{"id":3551,"date":"2020-02-19T12:05:44","date_gmt":"2020-02-19T17:05:44","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=3551"},"modified":"2020-02-21T08:33:21","modified_gmt":"2020-02-21T13:33:21","slug":"string-things","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2020\/02\/19\/string-things\/","title":{"rendered":"String Things"},"content":{"rendered":"\r\n\r\n<div class=\"content\"><!--introduction--><p>Working with text in MATLAB has evolved over time.  Way back, text data was stored in double arrays with an internal flag to denote that it was meant to be text.  We then transformed this representation so character arrays were their very own type. And I mentioned <a href=\"https:\/\/blogs.mathworks.com\/loren\/2016\/12\/22\/singing-the-praises-of-strings\/\">earlier<\/a> that we introduced a <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html\">string<\/a><\/tt> datatype to make working with text data more efficient and natural.  Let me show you a little more.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#b483be9b-810a-4855-af09-c6563fd8ff54\">How to Compare Text: the Olden Days<\/a><\/li><li><a href=\"#f17238b1-54cd-493f-b551-590d5cc650e6\">More Modern, Not Identical Use<\/a><\/li><li><a href=\"#2082facb-a945-40cd-845b-51a08a981675\">String Comparisons Circa 2020<\/a><\/li><li><a href=\"#19eabca1-ef46-4fbe-b45b-2e1d85d7b2d1\">My Advice: Err on the Side of Code Readability<\/a><\/li><li><a href=\"#12a17c79-4e47-464f-8d80-fc7c6984fe0c\">String Adoption<\/a><\/li><\/ul><\/div><h4>How to Compare Text: the Olden Days<a name=\"b483be9b-810a-4855-af09-c6563fd8ff54\"><\/a><\/h4><p>Early on in MATLAB, we used the function <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strcmp.html\"><tt>strcmp<\/tt><\/a> to compare strings.  A big caveat for many people is that <tt>strcmp<\/tt> does not behave the same way as its C-language counterpart.  We then added over time a few more comparison functions:<\/p><div><ul><li><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strcmpi.html\"><tt>strcmpi<\/tt><\/a><\/li><li><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strncmp.html\"><tt>strncmp<\/tt><\/a><\/li><li><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strncmpi.html\"><tt>strncmpi<\/tt><\/a><\/li><\/ul><\/div><p>to allow case-insensitive matches and to constrain the match to at most <tt>n<\/tt> characters.<\/p><p>Let's do some comparisons now.  First on cell arrays of strings...<\/p><pre class=\"codeinput\">cellChars = {<span class=\"string\">'Mercury'<\/span>,<span class=\"string\">'Venus'<\/span>,<span class=\"string\">'Earth'<\/span>,<span class=\"string\">'Mars'<\/span>}\r\n<\/pre><pre class=\"codeoutput\">cellChars =\r\n  1&times;4 cell array\r\n    {'Mercury'}    {'Venus'}    {'Earth'}    {'Mars'}\r\n<\/pre><pre class=\"codeinput\">TF = strcmp(<span class=\"string\">'fred'<\/span>,cellChars)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   0\r\n<\/pre><pre class=\"codeinput\">TF = strcmp(<span class=\"string\">'Venus'<\/span>,cellChars)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   1   0   0\r\n<\/pre><pre class=\"codeinput\">TF = strncmp(<span class=\"string\">'Mars'<\/span>, cellChars, 2)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   1\r\n<\/pre><pre class=\"codeinput\">TF = strncmp(<span class=\"string\">'Marvelous'<\/span>, cellChars, 2)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   1\r\n<\/pre><pre class=\"codeinput\">TF = strncmp(<span class=\"string\">'Marvelous'<\/span>, cellChars, 4)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   0\r\n<\/pre><pre class=\"codeinput\">TF = strcmpi(<span class=\"string\">'mars'<\/span>, cellChars)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   1\r\n<\/pre><pre class=\"codeinput\">TF = strcmpi(<span class=\"string\">'mar'<\/span>, cellChars)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   0\r\n<\/pre><h4>More Modern, Not Identical Use<a name=\"f17238b1-54cd-493f-b551-590d5cc650e6\"><\/a><\/h4><p>We also introduced <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/categorical.html\"><tt>categorical<\/tt><\/a> arrays for cases where limiting the set of string choices was appropriate.  When using <tt>categorical<\/tt> variables, you may use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/eq.html\"><tt>==<\/tt><\/a> for comparisons.<\/p><pre class=\"codeinput\">catStr = categorical(cellChars)\r\n<\/pre><pre class=\"codeoutput\">catStr = \r\n  1&times;4 categorical array\r\n     Mercury      Venus      Earth      Mars \r\n<\/pre><pre class=\"codeinput\">TF = <span class=\"string\">'Mars'<\/span> == catStr\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   1\r\n<\/pre><h4>String Comparisons Circa 2020<a name=\"2082facb-a945-40cd-845b-51a08a981675\"><\/a><\/h4><p>And now for <tt>string<\/tt> comparisons.<\/p><pre class=\"codeinput\">str = string(cellChars) <span class=\"comment\">% or [\"Mercury\",\"Venus\",\"Earth\",\"Mars\"]<\/span>\r\n<\/pre><pre class=\"codeoutput\">str = \r\n  1&times;4 string array\r\n    \"Mercury\"    \"Venus\"    \"Earth\"    \"Mars\"\r\n<\/pre><p>I can still use the <tt>str*cmp*<\/tt> functions.  But we are not restricted to them.<\/p><pre class=\"codeinput\">TF = strcmp (<span class=\"string\">'Mars'<\/span>, str)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   0   1\r\n<\/pre><p>We can now use <tt>==<\/tt> and related operators without worrying about indexing issues that might arise with character arrays.<\/p><pre class=\"codeinput\">TF = str ~= <span class=\"string\">\"Mars\"<\/span>\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   1   1   1   0\r\n<\/pre><p>And most recently, we introduced the function <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/matches.html\">matches<\/a><\/tt>.<\/p><pre class=\"codeinput\">TF = matches(str,<span class=\"string\">\"Earth\"<\/span>)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   1   0\r\n<\/pre><p>It's got some nice features that allow for handling string arrays very nifty.  Like looking for planets with an orbit inside Earth.<\/p><pre class=\"codeinput\">TF = matches(str,[<span class=\"string\">\"Mercury\"<\/span>,<span class=\"string\">\"Venus\"<\/span>])\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   1   1   0   0\r\n<\/pre><p>And I can, of course, ignore case, with code that, to me, appears less cryptic.<\/p><pre class=\"codeinput\">TF = matches(str,<span class=\"string\">\"earth\"<\/span>,<span class=\"string\">\"IgnoreCase\"<\/span>,true)\r\n<\/pre><pre class=\"codeoutput\">TF =\r\n  1&times;4 logical array\r\n   0   0   1   0\r\n<\/pre><p>As is true in all of these cases, we can index into the original array with the logical output to extract the relevant item(s).<\/p><pre class=\"codeinput\">str(TF)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    \"Earth\"\r\n<\/pre><h4>My Advice: Err on the Side of Code Readability<a name=\"19eabca1-ef46-4fbe-b45b-2e1d85d7b2d1\"><\/a><\/h4><p>I haven't touched on performance here, but one of the drivers for the recent <tt>string<\/tt> datatype is efficiency and performance.  We've worked hard to overlay that with functions that make your code highly readable. This makes code maintenance and code transfer go much more smoothly. I tend to favor this over eking out the last fractional second of speed. In the case of strings, you may not even need to make that tradeoff.<\/p><h4>String Adoption<a name=\"12a17c79-4e47-464f-8d80-fc7c6984fe0c\"><\/a><\/h4><p>Have you seen enough evidence that string are the future for working with textual data in MATLAB?  Tell us what you think <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=3551#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_d3cb06febe42423285020e1f8d0edd72() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='d3cb06febe42423285020e1f8d0edd72 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' d3cb06febe42423285020e1f8d0edd72';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2020 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_d3cb06febe42423285020e1f8d0edd72()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2019b<br><\/p><\/div><!--\r\nd3cb06febe42423285020e1f8d0edd72 ##### SOURCE BEGIN #####\r\n%% String Things\r\n% Working with text in MATLAB has evolved over time.  Way back, text data\r\n% was stored in double arrays with an internal flag to denote that it was\r\n% meant to be text.  We then transformed this representation so character\r\n% arrays were their very own type. And I mentioned\r\n% <https:\/\/blogs.mathworks.com\/loren\/2016\/12\/22\/singing-the-praises-of-strings\/\r\n% earlier> that we introduced a\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html string>| datatype\r\n% to make working with text data more efficient and natural.  Let me show\r\n% you a little more.\r\n\r\n%% How to Compare Text: the Olden Days\r\n% Early on in MATLAB, we used the function\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strcmp.html |strcmp|> to\r\n% compare strings.  A big caveat for many people is that |strcmp| does not\r\n% behave the same way as its C-language counterpart.  We then added over\r\n% time a few more comparison functions:\r\n%\r\n% * <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strcmpi.html |strcmpi|>\r\n% * <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strncmp.html |strncmp|>\r\n% * <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strncmpi.html |strncmpi|>\r\n% \r\n% to allow case-insensitive matches and to constrain the match to at most\r\n% |n| characters.\r\n%\r\n%%\r\n% Let's do some comparisons now.  First on cell arrays of strings...\r\n\r\ncellChars = {'Mercury','Venus','Earth','Mars'}\r\n%%\r\nTF = strcmp('fred',cellChars)\r\n%%\r\nTF = strcmp('Venus',cellChars)\r\n%% \r\nTF = strncmp('Mars', cellChars, 2)\r\n%% \r\nTF = strncmp('Marvelous', cellChars, 2)\r\n%%\r\nTF = strncmp('Marvelous', cellChars, 4)\r\n%%\r\nTF = strcmpi('mars', cellChars)\r\n%%\r\nTF = strcmpi('mar', cellChars)\r\n\r\n%% More Modern, Not Identical Use\r\n% We also introduced\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/categorical.html\r\n% |categorical|> arrays for cases where limiting the set of string choices\r\n% was appropriate.  When using |categorical| variables, you may use\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/eq.html |==|> for comparisons.\r\ncatStr = categorical(cellChars)\r\n%%\r\nTF = 'Mars' == catStr\r\n\r\n%% String Comparisons Circa 2020\r\n% And now for |string| comparisons.\r\nstr = string(cellChars) % or [\"Mercury\",\"Venus\",\"Earth\",\"Mars\"]\r\n%%\r\n% I can still use the |str*cmp*| functions.  But we are not restricted to\r\n% them.  \r\nTF = strcmp ('Mars', str)\r\n%%\r\n% We can now use |==| and related operators without worrying about indexing\r\n% issues that might arise with character arrays.\r\nTF = str ~= \"Mars\"\r\n\r\n%%\r\n% And most recently, we introduced the function\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/matches.html matches>|.  \r\nTF = matches(str,\"Earth\")\r\n%%\r\n% It's got some nice features that allow for handling string arrays very\r\n% nifty.  Like looking for planets with an orbit inside Earth.\r\nTF = matches(str,[\"Mercury\",\"Venus\"])\r\n%%\r\n% And I can, of course, ignore case, with code that, to me, appears less\r\n% cryptic.\r\nTF = matches(str,\"earth\",\"IgnoreCase\",true)\r\n%%\r\n% As is true in all of these cases, we can index into the original array\r\n% with the logical output to extract the relevant item(s).\r\nstr(TF)\r\n%% My Advice: Err on the Side of Code Readability\r\n% I haven't touched on performance here, but one of the drivers for the\r\n% recent |string| datatype is efficiency and performance.  We've worked\r\n% hard to overlay that with functions that make your code highly readable.\r\n% This makes code maintenance and code transfer go much more smoothly. I\r\n% tend to favor this over eking out the last fractional second of speed.\r\n% In the case of strings, you may not even need to make that tradeoff.\r\n%% String Adoption\r\n% Have you seen enough evidence that string are the future for working with\r\n% textual data in MATLAB?  Tell us what you think\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=3551#respond here>.\r\n \r\n##### SOURCE END ##### d3cb06febe42423285020e1f8d0edd72\r\n-->","protected":false},"excerpt":{"rendered":"<!--introduction--><p>Working with text in MATLAB has evolved over time.  Way back, text data was stored in double arrays with an internal flag to denote that it was meant to be text.  We then transformed this representation so character arrays were their very own type. And I mentioned <a href=\"https:\/\/blogs.mathworks.com\/loren\/2016\/12\/22\/singing-the-praises-of-strings\/\">earlier<\/a> that we introduced a <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html\">string<\/a><\/tt> datatype to make working with text data more efficient and natural.  Let me show you a little more.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2020\/02\/19\/string-things\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/3551"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=3551"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/3551\/revisions"}],"predecessor-version":[{"id":3563,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/3551\/revisions\/3563"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=3551"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=3551"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=3551"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}