{"id":2251,"date":"2017-04-24T08:43:01","date_gmt":"2017-04-24T13:43:01","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=2251"},"modified":"2017-04-23T08:45:43","modified_gmt":"2017-04-23T13:45:43","slug":"working-with-text-in-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2017\/04\/24\/working-with-text-in-matlab\/","title":{"rendered":"Working with Text in MATLAB"},"content":{"rendered":"\r\n<div class=\"content\"><!--introduction--><p><i>I'd like to introduce today's guest blogger, Dave Bergstein, a MATLAB Product Manager at MathWorks. In today's post, Dave discusses recent updates to text processing with MATLAB.<\/i><\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#c630a184-dbc1-4157-ba0c-750734b2f0ea\">Example: How Late Is My Bus?<\/a><\/li><li><a href=\"#639de2e4-6b1a-4096-9de9-9c276c273f52\">Text as Data<\/a><\/li><li><a href=\"#92951a06-3078-468f-b632-41bf094f6f8a\">Recommendations on Text Type<\/a><\/li><li><a href=\"#368701e4-1a35-42d5-b39a-eb4771b5318d\">Looking to the Future<\/a><\/li><\/ul><\/div><p>In today's post I share a text processing example using the new <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/represent-text-with-character-and-string-arrays.html\">string array<\/a> and a collection of new <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\">text manipulation functions<\/a>, both introduced in R2016b. I also give recommendations on when best to use <tt>string<\/tt>, <tt>char<\/tt>, or <tt>cell<\/tt> for text and share some of our thinking on the future.<\/p><p>Also be sure to check out Toshi's post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2016\/09\/15\/introducing-string-arrays\/\">Introducing String Arrays<\/a> and Loren's post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2016\/12\/22\/singing-the-praises-of-strings\/\">Singing the Praises of Strings<\/a>.<\/p><h4>Example: How Late Is My Bus?<a name=\"c630a184-dbc1-4157-ba0c-750734b2f0ea\"><\/a><\/h4><p>My friend in New York City talks about the delays on her bus route. Let's look at some data to see what typical delays are for trains and buses. The <a href=\"https:\/\/www.ny.gov\/programs\/open-ny\">Open NY Initiative<\/a> shares data which includes over 150,000 public transit events spanning 6 years. I downloaded this data as a CSV file from: <a href=\"https:\/\/data.ny.gov\/Transportation\/511-NY-MTA-Events-Beginning-2010\/i8wu-pqzv\">https:\/\/data.ny.gov\/Transportation\/511-NY-MTA-Events-Beginning-2010\/i8wu-pqzv<\/a><\/p><p><b>Import the Data<\/b><\/p><p>I read the data into a table using <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\">readtable<\/a><\/tt> and specify the <tt>TextType<\/tt> name-value pair as <tt>string<\/tt> to read the text as string arrays.<\/p><pre class=\"codeinput\">data = readtable(<span class=\"string\">'511_NY_MTA_Events__Beginning_2010.csv'<\/span>,<span class=\"string\">'TextType'<\/span>,<span class=\"string\">'string'<\/span>);\r\n<\/pre><pre class=\"codeoutput\">Warning: Variable names were modified to make them valid MATLAB\r\nidentifiers. The original names are saved in the VariableDescriptions\r\nproperty. \r\n<\/pre><p>Here is a list of variables in the table:<\/p><pre class=\"codeinput\">data.Properties.VariableNames\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  1&times;13 cell array\r\n  Columns 1 through 4\r\n    'EventType'    'OrganizationName'    'FacilityName'    'Direction'\r\n  Columns 5 through 9\r\n    'City'    'County'    'State'    'CreateTime'    'CloseTime'\r\n  Columns 10 through 13\r\n    'EventDescription'    'RespondingOrganiz&#8230;'    'Latitude'    'Longitude'\r\n<\/pre><p><tt>data.EventDescription<\/tt> is a string array which contains the event descriptions. Let's take a closer look at the events.<\/p><pre class=\"codeinput\">eventsStr = data.EventDescription;\r\n<\/pre><p>Unlike character vectors or cell array of character vectors, each element of the string array is a string itself. See how I can index the string array just as I would a numeric array and get strings arrays back.<\/p><pre class=\"codeinput\">eventsStr(1:3)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  3&times;1 string array\r\n    \"MTA NYC Transit Bus: due to Earlier flooding Q11 into Hamilton Beach normal service resumed\"\r\n    \"MTA NYC Transit Bus: due to Construction, northbound M1 Bus area of 147th Street:Adam Clayton Powell Junior\"\r\n    \"MTA NYC Transit Subway: due to Delays, Bronx Bound # 2 &amp; 3 Lines at Nevins Street Station (Brooklyn)\"\r\n<\/pre><p>Many of the event descriptions report delays like 'operating 10 minutes late'. See for example how the 26-minute delay is reported in event 5180.<\/p><pre class=\"codeinput\">eventsStr(5180)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    \"MTA Long Island Rail Road: due to Debris on tracks, westbound Montauk Branch between Montauk Station (Suffolk County)  and Jamaica Station (Queens)  The 6:44 AM from Montauk due Jamaica at 9:32 AM, is operating 26 minutes late due to an unauthorized vehicle on the tracks near Hampton Bays.\"\r\n<\/pre><p><b>Identify Delays<\/b><\/p><p>I want to find all the events which contain ' late '. MATLAB R2016b also introduced more than a dozen new functions for working with text. These functions work with character vectors, cell arrays of character vectors, and string arrays. You can learn about these functions from the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\">characters and strings<\/a> page in our documentation.<\/p><p>I convert the text to all lowercase and determine which events contain ' late ' using the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/contains.html\"><tt>contains<\/tt><\/a> function.<\/p><pre class=\"codeinput\">eventsStr = lower(eventsStr);\r\nidx = contains(eventsStr,<span class=\"string\">' late '<\/span>);\r\nlateEvents = eventsStr(idx);\r\n<\/pre><p><b>Extract the Delay Times<\/b><\/p><p>I extract the minutes late from phrases like 'operating 10 minutes late' using the functions <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractafter.html\"><tt>extractAfter<\/tt><\/a> and <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractbefore.html\"><tt>extractBefore<\/tt><\/a>.<\/p><p>Let's look at the first late event. The exact phrase we are seeking doesn't appear in this event. When we look for the text following 'operating' we get back a <tt>missing<\/tt> string.<\/p><pre class=\"codeinput\">lateEvents(1)\r\nextractAfter(lateEvents(1),<span class=\"string\">'operating'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    \"mta long island rail road: due to delays, westbound babylon branch between speonk station (speonk)  and new york penn station (manhattan)  the 5:08 a.m. departure due ny @ 7:02 a.m. is 15 minutes late @ babylon.\"\r\nans = \r\n    &lt;missing&gt;\r\n<\/pre><p>Let's look at the second late event. This string contains the phrase 'operating 14 minutes late'. Extracting the text after 'operating' we get '14 minutes late due to signal problems'. Extracting the text before 'minutes late' we get back ' 14 ' which we can convert to a numeric value using <tt>double<\/tt>.<\/p><pre class=\"codeinput\">lateEvents(2)\r\ns = extractAfter(lateEvents(2),<span class=\"string\">'operating'<\/span>)\r\ns = extractBefore(s,<span class=\"string\">'minutes late'<\/span>)\r\nminLate = double(s)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    \"mta long island rail road: due to delays westbound ronkonkoma branch out of bethpage station (suffolk county) the 8:01 am train due into penn station at 8:47 am is operating 14 minutes late due to signal problems\"\r\ns = \r\n    \" 14 minutes late due to signal problems\"\r\ns = \r\n    \" 14 \"\r\nminLate =\r\n    14\r\n<\/pre><p>Success! We extracted the train delay from the event description. Now let's put this all together. I extract the minutes late from all the events and drop the missing values using the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/rmmissing.html\"><tt>rmmissing<\/tt><\/a> function. I then convert the remaining values to numbers using <tt>double<\/tt> and plot a histogram of the results.<\/p><pre class=\"codeinput\">s = extractAfter(lateEvents,<span class=\"string\">'operating'<\/span>);\r\ns = extractBefore(s,<span class=\"string\">'minutes late'<\/span>);\r\ns = rmmissing(s);\r\nminLate = double(s);\r\n\r\nhistogram(minLate,0:5:40)\r\nylabel(<span class=\"string\">'Number of Events'<\/span>)\r\nxlabel(<span class=\"string\">'Minutes Late'<\/span>)\r\ntitle({<span class=\"string\">'Transit Delays'<\/span>,<span class=\"string\">'NY Metropolitan Transit Authority'<\/span>})\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/StringBlogV7_01.png\" alt=\"\"> <p>It looks like reported delays are often 10-15 minutes. This simple routine captures many of the transit delays, but not all. The pattern doesn't always fit (consider again <tt>lateEvents(1)<\/tt>). I also left out any delays that may be reported in hours. Can you improve it?<\/p><h4>Text as Data<a name=\"639de2e4-6b1a-4096-9de9-9c276c273f52\"><\/a><\/h4><p>String arrays are a great choice for text data like the example above because they are memory efficient and perform better than cell arrays of character vectors (previously known as <tt>cellstr<\/tt>).<\/p><p>Let's compare the memory usage. I convert the string array to a cell array of character vectors with the <tt>cellstr<\/tt> command and check the memory with <tt>whos<\/tt>. See the Bytes column - it shows the string array is about 12% more efficient.<\/p><pre class=\"codeinput\">eventsCell = cellstr(eventsStr);\r\nwhos <span class=\"string\">events*<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                 Size               Bytes  Class     Attributes\r\n\r\n  eventsCell      151225x1             73208886  cell                \r\n  eventsStr       151225x1             64662486  string              \r\n\r\n<\/pre><p>The memory savings can be much greater for many smaller pieces of text. for example, suppose I want to store each word as a separate array element. First I join all 150,000 reports into a single long string using the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/join.html\">join<\/a><\/tt> function. I then split this long string on spaces using the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/split.html\">split<\/a><\/tt> function. The result is a string array storing over 4 million words in separate elements. Here the memory savings is nearly 2X.<\/p><pre class=\"codeinput\">wordsStr = split(join(eventsStr));\r\nwordsCell = split(join(eventsCell));\r\nwhos <span class=\"string\">words*<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                 Size                Bytes  Class     Attributes\r\n\r\n  wordsCell      4356256x1             535429652  cell                \r\n  wordsStr       4356256x1             284537656  string              \r\n\r\n<\/pre><p>String arrays also perform better. You can achieve the best performance using string arrays in combination with the text manipulation functions introduced in R2016b. Here I compare the performance of <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html\"><tt>replace<\/tt><\/a> on a string array with that of <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strrep.html\"><tt>strrep<\/tt><\/a> on a cell array of character vectors. See how <tt>replace<\/tt> with a string array is about 4X faster than <tt>strrep<\/tt> with a cell array.<\/p><pre class=\"codeinput\">f1 = @() replace(eventsStr,<span class=\"string\">'delay'<\/span>,<span class=\"string\">'late'<\/span>);\r\nf2 = @() strrep(eventsCell,<span class=\"string\">'delay'<\/span>,<span class=\"string\">'late'<\/span>);\r\ntimeit(f1)\r\ntimeit(f2)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     0.062507\r\nans =\r\n      0.23239\r\n<\/pre><h4>Recommendations on Text Type<a name=\"92951a06-3078-468f-b632-41bf094f6f8a\"><\/a><\/h4><p>So, should you use string arrays for all your text? Maybe not yet. MATLAB has three different ways to store text:<\/p><div><ul><li>character vectors (<tt>char<\/tt>)<\/li><li>string arrays (<tt>string<\/tt>)<\/li><li>cell arrays of character vectors (<tt>cell<\/tt>)<\/li><\/ul><\/div><p>For now (as of R2017a), we encourage you to use string arrays to store text data such as the transit events. We don&#8217;t recommend using string arrays elsewhere yet since string arrays aren&#8217;t yet accepted everywhere in MATLAB. Notice how I used a character vector for specifying the filename in <tt>readtable<\/tt> and a cell array of character vectors for the figure title.<\/p><h4>Looking to the Future<a name=\"368701e4-1a35-42d5-b39a-eb4771b5318d\"><\/a><\/h4><p>What about in the future? We feel string arrays provide a better experience than character vectors and cell arrays of character vectors. Our plan is to roll out broader use of string arrays over time.<\/p><p>In the next few releases we will update more MATLAB functions and properties to accept string arrays in addition to character vectors and cell arrays of character vectors. As we do so, it will become easier for you to use string arrays in more places.<\/p><p>Next we will replace cell arrays of character vectors in MATLAB with string arrays. Note that cell arrays themselves aren't going anywhere. They are an important MATLAB container type and good for storing mixed data types or arrays of jagged size among other uses. But we expect their use for text data will diminish and become largely replaced by string arrays which are more memory efficient and perform better for pure text data.<\/p><p>Beyond that, over time, we will use string arrays in new functions and new properties in place of character vectors (but will continue returning character vectors in many places for compatibility). We expect character vectors will continue to live on for version-to-version code compatibility and special use cases.<\/p><p>Speaking of compatibility: we care deeply about version-to-version compatibility of MATLAB code, today more than ever. So, we are taking the following steps in our roll out of string arrays:<\/p><div><ol><li>Text manipulation functions (both old and new) return the text type they are passed. This means you can opt-in to using string with these functions (string use isn't necessary). Note how I used <tt>split<\/tt> and <tt>join<\/tt> above with either string arrays or cell arrays of character vectors.<\/li><li>We are recommending string arrays today for text data applications. Here there are ways to opt-in to string use. In the example, I opted to get a string array from <tt>readtable<\/tt> using the <tt>TextType<\/tt> name-value pair. And string arrays were returned from functions like <tt>extractBefore<\/tt> because I passed a string array as input.<\/li><li>We added curly brace indexing to string arrays which returns a character vector for compatibility. Cell arrays return their contents when you index with curly braces <tt>{}<\/tt>. Code that uses cell arrays of character vectors usually indexes the array with curly braces to access the character vector. Such code can work with string arrays since curly brace indexing will also return a character vector. See how the following code returns the same result whether <tt>f<\/tt> is a cell array or a string:<\/li><\/ol><\/div><pre class=\"codeinput\">d = datetime(<span class=\"string\">'now'<\/span>);\r\nf = {<span class=\"string\">'h'<\/span>,<span class=\"string\">'m'<\/span>,<span class=\"string\">'s'<\/span>};   <span class=\"comment\">% use a cell array<\/span>\r\n<span class=\"keyword\">for<\/span> n = 1:3,\r\n    d.Format = f{n};\r\n    disp(d)\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">   9\r\n   43\r\n   10\r\n<\/pre><pre class=\"codeinput\">f = [<span class=\"string\">\"h\"<\/span>,<span class=\"string\">\"m\"<\/span>,<span class=\"string\">\"s\"<\/span>];   <span class=\"comment\">% use a string array<\/span>\r\n<span class=\"keyword\">for<\/span> n = 1:3,\r\n    d.Format = f{n};\r\n    disp(d)\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">   9\r\n   43\r\n   10\r\n<\/pre><p>Expect to hear more from me on this topic. And please share your input with us by leaving a comment below. We're interested to <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=2251#respond\">hear from you<\/a>.<\/p><p>We hope string arrays will help you accomplish your goals and that the steps we're taking provide a smooth adoption. If you haven't tried string arrays yet, learn more from our documentation on <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\">characters and strings<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_84aa8c4710c3480fa6b2a8d097849117() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='84aa8c4710c3480fa6b2a8d097849117 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 84aa8c4710c3480fa6b2a8d097849117';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_84aa8c4710c3480fa6b2a8d097849117()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\n84aa8c4710c3480fa6b2a8d097849117 ##### SOURCE BEGIN #####\r\n%% Working with Text in MATLAB\r\n% _I'd like to introduce today's guest blogger, Dave Bergstein, a MATLAB\r\n% Product Manager at MathWorks. In today's post, Dave discusses recent\r\n% updates to text processing with MATLAB._\r\n\r\n%%\r\n% In today's post I share a text processing example using the new\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/represent-text-with-character-and-string-arrays.html\r\n% string array> and a collection of new\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html text\r\n% manipulation functions>, both introduced in R2016b. I also give\r\n% recommendations on when best to use |string|, |char|, or |cell| for text\r\n% and share some of our thinking on the future.\r\n%\r\n% Also be sure to check out Toshi's post \r\n% <https:\/\/blogs.mathworks.com\/loren\/2016\/09\/15\/introducing-string-arrays\/\r\n% Introducing String Arrays> and Loren's post\r\n% <https:\/\/blogs.mathworks.com\/loren\/2016\/12\/22\/singing-the-praises-of-strings\/\r\n% Singing the Praises of Strings>.\r\n\r\n%% Example: How Late Is My Bus?\r\n% My friend in New York City talks about the delays on her bus route. Let's\r\n% look at some data to see what typical delays are for trains and buses.\r\n% The <https:\/\/www.ny.gov\/programs\/open-ny Open NY Initiative> shares data\r\n% which includes over 150,000 public transit events spanning 6 years. I\r\n% downloaded this data as a CSV file from:\r\n% <https:\/\/data.ny.gov\/Transportation\/511-NY-MTA-Events-Beginning-2010\/i8wu-pqzv>\r\n\r\n%% \r\n% *Import the Data*\r\n%\r\n% I read the data into a table using\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html readtable>|\r\n% and specify the |TextType| name-value pair as |string| to read the text\r\n% as string arrays.\r\ndata = readtable('511_NY_MTA_Events__Beginning_2010.csv','TextType','string');\r\n%% \r\n% Here is a list of variables in the table:\r\ndata.Properties.VariableNames\r\n%%\r\n% |data.EventDescription| is a string array which contains the event\r\n% descriptions. Let's take a closer look at the events.\r\neventsStr = data.EventDescription;\r\n%%\r\n% Unlike character vectors or cell array of character vectors, each element\r\n% of the string array is a string itself. See how I can index the string\r\n% array just as I would a numeric array and get strings arrays back.\r\neventsStr(1:3)\r\n\r\n%% \r\n% Many of the event descriptions report delays like 'operating 10 minutes\r\n% late'. See for example how the 26-minute delay is reported in event 5180.\r\neventsStr(5180)\r\n%% \r\n% *Identify Delays*\r\n%\r\n% I want to find all the events which contain ' late '. MATLAB R2016b also\r\n% introduced more than a dozen new functions for working with text. These\r\n% functions work with character vectors, cell arrays of character vectors,\r\n% and string arrays. You can learn about these functions from the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\r\n% characters and strings> page in our documentation.\r\n%\r\n% I convert the text to all lowercase and determine which events contain '\r\n% late ' using the <https:\/\/www.mathworks.com\/help\/matlab\/ref\/contains.html\r\n% |contains|> function.\r\neventsStr = lower(eventsStr);\r\nidx = contains(eventsStr,' late ');\r\nlateEvents = eventsStr(idx);\r\n%% \r\n% *Extract the Delay Times*\r\n%\r\n% I extract the minutes late from phrases like 'operating 10 minutes late'\r\n% using the functions\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractafter.html\r\n% |extractAfter|> and\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/extractbefore.html\r\n% |extractBefore|>.\r\n%\r\n% Let's look at the first late event. The exact phrase we are seeking\r\n% doesn't appear in this event. When we look for the text following\r\n% 'operating' we get back a |missing| string.\r\nlateEvents(1)\r\nextractAfter(lateEvents(1),'operating')\r\n%%\r\n% Let's look at the second late event. This string contains the phrase\r\n% 'operating 14 minutes late'. Extracting the text after 'operating' we get\r\n% '14 minutes late due to signal problems'. Extracting the text before 'minutes late' we get back\r\n% ' 14 ' which we can convert to a numeric value using |double|.\r\nlateEvents(2)\r\ns = extractAfter(lateEvents(2),'operating')\r\ns = extractBefore(s,'minutes late')\r\nminLate = double(s)\r\n%%\r\n% Success! We extracted the train delay from the event description. Now\r\n% let's put this all together. I extract the minutes late from all the\r\n% events and drop the missing values using the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/rmmissing.html |rmmissing|>\r\n% function. I then convert the remaining values to numbers using |double|\r\n% and plot a histogram of the results.\r\ns = extractAfter(lateEvents,'operating');\r\ns = extractBefore(s,'minutes late');\r\ns = rmmissing(s);\r\nminLate = double(s);\r\n\r\nhistogram(minLate,0:5:40)\r\nylabel('Number of Events')\r\nxlabel('Minutes Late')\r\ntitle({'Transit Delays','NY Metropolitan Transit Authority'})\r\n%%\r\n% It looks like reported delays are often 10-15 minutes. This simple\r\n% routine captures many of the transit delays, but not all. The pattern\r\n% doesn't always fit (consider again |lateEvents(1)|). I also left out any\r\n% delays that may be reported in hours. Can you improve it?\r\n%\r\n%% Text as Data\r\n%\r\n% String arrays are a great choice for text data like the example above\r\n% because they are memory efficient and perform better than cell arrays of\r\n% character vectors (previously known as |cellstr|).\r\n%\r\n% Let's compare the memory usage. I convert the string array to a cell\r\n% array of character vectors with the |cellstr| command and check the\r\n% memory with |whos|. See the Bytes column - it shows the string array is\r\n% about 12% more efficient.\r\neventsCell = cellstr(eventsStr);\r\nwhos events*\r\n%%\r\n% The memory savings can be much greater for many smaller pieces of text.\r\n% for example, suppose I want to store each word as a separate array\r\n% element. First I join all 150,000 reports into a single long string using\r\n% the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/join.html join>|\r\n% function. I then split this long string on spaces using the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/split.html split>| function.\r\n% The result is a string array storing over 4 million words in separate\r\n% elements. Here the memory savings is nearly 2X.\r\nwordsStr = split(join(eventsStr));\r\nwordsCell = split(join(eventsCell));\r\nwhos words*\r\n%%\r\n% String arrays also perform better. You can achieve the best performance\r\n% using string arrays in combination with the text manipulation functions\r\n% introduced in R2016b. Here I compare the performance of\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html |replace|> on a\r\n% string array with that of\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strrep.html |strrep|> on a\r\n% cell array of character vectors. See how |replace| with a string array is\r\n% about 4X faster than |strrep| with a cell array.\r\nf1 = @() replace(eventsStr,'delay','late'); \r\nf2 = @() strrep(eventsCell,'delay','late'); \r\ntimeit(f1)\r\ntimeit(f2)\r\n\r\n%% Recommendations on Text Type\r\n% So, should you use string arrays for all your text? Maybe not yet. MATLAB\r\n% has three different ways to store text:\r\n% \r\n% * character vectors (|char|)\r\n% * string arrays (|string|)\r\n% * cell arrays of character vectors (|cell|)\r\n%\r\n% For now (as of R2017a), we encourage you to use string arrays to store\r\n% text data such as the transit events. We don\u00e2\u20ac\u2122t recommend using string\r\n% arrays elsewhere yet since string arrays aren\u00e2\u20ac\u2122t yet accepted everywhere\r\n% in MATLAB. Notice how I used a character vector for specifying the\r\n% filename in |readtable| and a cell array of character vectors for the\r\n% figure title.\r\n\r\n%% Looking to the Future\r\n% What about in the future? We feel string arrays provide a better\r\n% experience than character vectors and cell arrays of character vectors.\r\n% Our plan is to roll out broader use of string arrays over time.\r\n%\r\n% In the next few releases we will update more MATLAB functions and\r\n% properties to accept string arrays in addition to character vectors and\r\n% cell arrays of character vectors. As we do so, it will become easier for\r\n% you to use string arrays in more places.\r\n%\r\n% Next we will replace cell arrays of character vectors in MATLAB with\r\n% string arrays. Note that cell arrays themselves aren't going anywhere.\r\n% They are an important MATLAB container type and good for storing mixed\r\n% data types or arrays of jagged size among other uses. But we expect their\r\n% use for text data will diminish and become largely replaced by string\r\n% arrays which are more memory efficient and perform better for pure text\r\n% data.\r\n%\r\n% Beyond that, over time, we will use string arrays in new functions and\r\n% new properties in place of character vectors (but will continue returning\r\n% character vectors in many places for compatibility). We expect character\r\n% vectors will continue to live on for version-to-version code\r\n% compatibility and special use cases.\r\n%\r\n% Speaking of compatibility: we care deeply about version-to-version\r\n% compatibility of MATLAB code, today more than ever. So, we are taking the\r\n% following steps in our roll out of string arrays:\r\n%%\r\n% # Text manipulation functions (both old and new) return the text type\r\n% they are passed. This means you can opt-in to using string with these\r\n% functions (string use isn't necessary). Note how I used |split| and\r\n% |join| above with either string arrays or cell arrays of character\r\n% vectors. \r\n% # We are recommending string arrays today for text data applications.\r\n% Here there are ways to opt-in to string use. In the example, I opted to\r\n% get a string array from |readtable| using the |TextType| name-value pair.\r\n% And string arrays were returned from functions like |extractBefore|\r\n% because I passed a string array as input.\r\n% # We added curly brace indexing to string arrays which returns a\r\n% character vector for compatibility. Cell arrays return their contents\r\n% when you index with curly braces |{}|. Code that uses cell arrays of\r\n% character vectors usually indexes the array with curly braces to access\r\n% the character vector. Such code can work with string arrays since curly\r\n% brace indexing will also return a character vector. See how the following\r\n% code returns the same result whether |f| is a cell array or a string:\r\nd = datetime('now');\r\nf = {'h','m','s'};   % use a cell array\r\nfor n = 1:3,\r\n    d.Format = f{n};\r\n    disp(d)\r\nend\r\n\r\n%%\r\n%\r\n\r\nf = [\"h\",\"m\",\"s\"];   % use a string array\r\nfor n = 1:3,\r\n    d.Format = f{n};\r\n    disp(d)\r\nend\r\n\r\n%%\r\n% Expect to hear more from me on this topic. And please share your input\r\n% with us by leaving a comment below. We're interested to\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=2251#respond hear from you>.\r\n%\r\n% We hope string arrays will help you accomplish your goals and that the\r\n% steps we're taking provide a smooth adoption. If you haven't tried string\r\n% arrays yet, learn more from our documentation on\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\r\n% characters and strings>.\r\n\r\n\r\n##### SOURCE END ##### 84aa8c4710c3480fa6b2a8d097849117\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2017\/StringBlogV7_01.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p><i>I'd like to introduce today's guest blogger, Dave Bergstein, a MATLAB Product Manager at MathWorks. In today's post, Dave discusses recent updates to text processing with MATLAB.<\/i>... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2017\/04\/24\/working-with-text-in-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[7,6,58,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2251"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=2251"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2251\/revisions"}],"predecessor-version":[{"id":2296,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2251\/revisions\/2296"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=2251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=2251"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=2251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}