{"id":2041,"date":"2016-09-15T08:02:03","date_gmt":"2016-09-15T13:02:03","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=2041"},"modified":"2020-07-28T16:39:35","modified_gmt":"2020-07-28T20:39:35","slug":"introducing-string-arrays","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2016\/09\/15\/introducing-string-arrays\/","title":{"rendered":"Introducing String Arrays"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi<\/a> is back for today's guest post. You may have seen <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">Toshi's earlier posts about text analytics<\/a> and he often deals with text in his data analysis. So he is very excited about new <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html\">string<\/a> arrays in R2016b.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/joy.jpg\" alt=\"\"> <\/p><p>One of the new features I love in R2016b is string arrays, which give you a new way to handle text in MATLAB in addition to the familiar <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\">character arrays and cell arrays of character vectors<\/a>. String arrays are most helpful when dealing with text in your data. In today's post I walk through some practical examples with text data to demonstrate how to use string arrays.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#824b047c-01ff-4ffc-9217-66fe8611b88a\">Analyzing Baby Name Trends<\/a><\/li><li><a href=\"#cfa87aae-e529-497d-9166-4a2954e51b03\">String Concatenation<\/a><\/li><li><a href=\"#1d2f69c8-fa3a-4691-8939-3624a3892e9e\">Combine Files to Create a Single Table<\/a><\/li><li><a href=\"#11bb5d99-66c6-4109-995b-f29f206e6a56\">String Comparison<\/a><\/li><li><a href=\"#97f1dde6-5ba8-441b-9642-297242830ea5\">Memory Use and Performance of String Arrays vs. Cell Arrays<\/a><\/li><li><a href=\"#b5b080b1-b0f2-40a8-a95d-7b1f08bbe1df\">Data Wrangling Example<\/a><\/li><li><a href=\"#5fd5415b-6a18-45d5-bf7b-6b7d3a3b85a5\">Fixing Typos or Inconsistent Labeling<\/a><\/li><li><a href=\"#353def77-a519-4592-9ff8-929a60baa406\">Find and Convert Substrings<\/a><\/li><li><a href=\"#df521f3b-f989-4dd8-823a-7e286b962f7c\">Tokenization<\/a><\/li><li><a href=\"#2696bbc8-48ae-4a51-995b-36e4bbef78e0\">Document Term Frequency Matrix<\/a><\/li><li><a href=\"#7b1e1d51-9534-42a8-8767-42a484ee8c6f\">Non-English Text<\/a><\/li><li><a href=\"#66a37579-8c75-4441-a6d7-31b90e3d8318\">Summary<\/a><\/li><\/ul><\/div><h4>Analyzing Baby Name Trends<a name=\"824b047c-01ff-4ffc-9217-66fe8611b88a\"><\/a><\/h4><p>Let's play with strings using the <a href=\"http:\/\/www.ssa.gov\/oact\/babynames\/limits.html\">baby names dataset<\/a> from Social Security Administration. The data is stored in separate text files by year of birth from 1880 to 2015. Let's begin by previewing one of them.<\/p><pre class=\"codeinput\"><span class=\"keyword\">if<\/span> ~isdir(<span class=\"string\">'names'<\/span>)                                          <span class=\"comment\">% if |names| folder doesn't exist<\/span>\r\n    url = <span class=\"string\">'https:\/\/www.ssa.gov\/oact\/babynames\/names.zip'<\/span>;   <span class=\"comment\">% url of the zipped data file<\/span>\r\n    unzip(url,<span class=\"string\">'names'<\/span>)                                      <span class=\"comment\">% download and unzip data into |names| folder<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ntbl1880 = readtable(<span class=\"string\">'names\/yob1880.txt'<\/span>);                   <span class=\"comment\">% read the first file<\/span>\r\nvars = {<span class=\"string\">'name'<\/span>,<span class=\"string\">'sex'<\/span>,<span class=\"string\">'births'<\/span>};                             <span class=\"comment\">% column names<\/span>\r\ntbl1880.Properties.VariableNames = vars;                    <span class=\"comment\">% add column names<\/span>\r\ndisp(tbl1880(1:5,:))                                        <span class=\"comment\">% preview 5 rows<\/span>\r\n<\/pre><pre class=\"codeoutput\">       name        sex    births\r\n    ___________    ___    ______\r\n    'Mary'         'F'    7065  \r\n    'Anna'         'F'    2604  \r\n    'Emma'         'F'    2003  \r\n    'Elizabeth'    'F'    1939  \r\n    'Minnie'       'F'    1746  \r\n<\/pre><h4>String Concatenation<a name=\"cfa87aae-e529-497d-9166-4a2954e51b03\"><\/a><\/h4><p>Rather than loading each file into a separate table, we would like to create a single table that spans all the years available. The files are named with a convention: 'yob' + year + '.txt', which we can use to generate the file paths. With a string array, we can take advantage of array expansion to generate the list of filenames.<\/p><pre class=\"codeinput\">years = 1880:2015;                                          <span class=\"comment\">% vector of years in double<\/span>\r\nfilepaths = string(<span class=\"string\">'names\/yob'<\/span>) + years + <span class=\"string\">'.txt'<\/span>;           <span class=\"comment\">% concatenate string with numbers<\/span>\r\nfilepaths(1:3)                                              <span class=\"comment\">% indexing into the first 3 elements<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  1&times;3 string array\r\n    \"names\/yob1880.txt\"    \"names\/yob1881.txt\"    \"names\/yob1882.txt\"\r\n<\/pre><h4>Combine Files to Create a Single Table<a name=\"1d2f69c8-fa3a-4691-8939-3624a3892e9e\"><\/a><\/h4><p>Let's create a single table that spans all the years available. Note that we need to use <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/char.html\">char<\/a><\/tt> to convert the individual filename strings to character vectors for use with <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\">readtable<\/a><\/tt>. We'll set the <tt>readtable<\/tt> parameter <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html#namevaluepairs\">TextType<\/a><\/tt> to <tt>'string'<\/tt> so that the text data is read into the table as string arrays. When you preview the first five rows, notice that text is surrounded by double quotes rather than single quotes, which indicates they are represented as string arrays.<\/p><pre class=\"codeinput\">names = cell(length(years), 1);                             <span class=\"comment\">% accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:length(years)                                    <span class=\"comment\">% for each year<\/span>\r\n    names{ii} = readtable(char(filepaths(ii)), <span class=\"keyword\">...<\/span><span class=\"comment\">          % read individual files<\/span>\r\n        <span class=\"string\">'ReadVariableNames'<\/span>, false, <span class=\"keyword\">...<\/span><span class=\"comment\">                     % into separate tables<\/span>\r\n        <span class=\"string\">'TextType'<\/span>,<span class=\"string\">'string'<\/span>);                               <span class=\"comment\">% with text in string arrays<\/span>\r\n    names{ii}.Properties.VariableNames = vars;              <span class=\"comment\">% add column names<\/span>\r\n    names{ii}.year = repmat(years(ii), <span class=\"keyword\">...<\/span><span class=\"comment\">                  % add |year| column<\/span>\r\n        height(names{ii}), 1);\r\n<span class=\"keyword\">end<\/span>\r\nnames = vertcat(names{:});                                  <span class=\"comment\">% concatenate tables<\/span>\r\ndisp(names(1:5,:))                                          <span class=\"comment\">% preview 5 rows<\/span>\r\n<\/pre><pre class=\"codeoutput\">       name        sex    births    year\r\n    ___________    ___    ______    ____\r\n    \"Mary\"         \"F\"    7065      1880\r\n    \"Anna\"         \"F\"    2604      1880\r\n    \"Emma\"         \"F\"    2003      1880\r\n    \"Elizabeth\"    \"F\"    1939      1880\r\n    \"Minnie\"       \"F\"    1746      1880\r\n<\/pre><h4>String Comparison<a name=\"11bb5d99-66c6-4109-995b-f29f206e6a56\"><\/a><\/h4><p>Let's plot how the popularity of the names 'Jack' and 'Emily' have changed over time. With string arrays, you can simply use the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/eq.html\">==<\/a><\/tt> operator for comparison. This makes our code clearer as compared to using <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strcmp.html\">strcmp<\/a><\/tt>. We can observe that Emily has seen a surge of popularity in recent years and Jack is staging a comeback.<\/p><pre class=\"codeinput\">Jack = names(names.name == <span class=\"string\">'Jack'<\/span>, :);                      <span class=\"comment\">% rows named 'Jack' only<\/span>\r\nEmily = names(names.name == <span class=\"string\">'Emily'<\/span>, :);                    <span class=\"comment\">% rows named 'Emily' only<\/span>\r\nEmily = Emily(Emily.sex == <span class=\"string\">'F'<\/span>, :);                         <span class=\"comment\">% just girls<\/span>\r\nJack = Jack(Jack.sex == <span class=\"string\">'M'<\/span>, :);                            <span class=\"comment\">% just boys<\/span>\r\nfigure                                                      <span class=\"comment\">% new figure<\/span>\r\nplot(Jack.year, Jack.births);                               <span class=\"comment\">% plot Jack<\/span>\r\nhold <span class=\"string\">on<\/span>                                                     <span class=\"comment\">% don't overwrite<\/span>\r\nplot(Emily.year, Emily.births);                             <span class=\"comment\">% plot Emily<\/span>\r\nhold <span class=\"string\">off<\/span>                                                    <span class=\"comment\">% enable overwrite<\/span>\r\ntitle(<span class=\"string\">'Baby Name Popularity'<\/span>);                              <span class=\"comment\">% add title<\/span>\r\nxlabel(<span class=\"string\">'year'<\/span>); ylabel(<span class=\"string\">'births'<\/span>);                           <span class=\"comment\">% add axis labels<\/span>\r\nlegend(<span class=\"string\">'Jack'<\/span>, <span class=\"string\">'Emily'<\/span>, <span class=\"string\">'Location'<\/span>, <span class=\"string\">'NorthWest'<\/span>)            <span class=\"comment\">% add legend<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/introducing_string2_01.png\" alt=\"\"> <h4>Memory Use and Performance of String Arrays vs. Cell Arrays<a name=\"97f1dde6-5ba8-441b-9642-297242830ea5\"><\/a><\/h4><p>Let's consider the impact string arrays have on memory usage.<\/p><pre class=\"codeinput\">namesString = names.name;                                   <span class=\"comment\">% this is string<\/span>\r\nnamesCellAr = cellstr(namesString);                         <span class=\"comment\">% convert to cellstr<\/span>\r\nwhos(<span class=\"string\">'namesString'<\/span>, <span class=\"string\">'namesCellAr'<\/span>)                          <span class=\"comment\">% check size and type<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                   Size                Bytes  Class     Attributes\r\n\r\n  namesCellAr      1858689x1             231124058  cell                \r\n  namesString      1858689x1             120288006  string              \r\n\r\n<\/pre><p>The string array uses about half the memory of the cell array of character vectors in this case. The memory savings depends on the array data and size and is pronounced for arrays with many elements like this one.<\/p><p>In most cases, you can also achieve better performance when you use string arrays with new string manipulation methods. <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html\">replace<\/a><\/tt> is a new string method which you can often use in place of <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strrep.html\">strrep<\/a><\/tt> for replacing substrings of text. Notice the performance difference:<\/p><pre class=\"codeinput\">tic, strrep(namesCellAr,<span class=\"string\">'Joey'<\/span>,<span class=\"string\">'Joe'<\/span>); toc,                 <span class=\"comment\">% time strrep operation<\/span>\r\ntic, replace(namesString,<span class=\"string\">'Joey'<\/span>,<span class=\"string\">'Joe'<\/span>); toc,                <span class=\"comment\">% time replace operation<\/span>\r\n<\/pre><pre class=\"codeoutput\">Elapsed time is 0.807283 seconds.\r\nElapsed time is 0.409385 seconds.\r\n<\/pre><h4>Data Wrangling Example<a name=\"b5b080b1-b0f2-40a8-a95d-7b1f08bbe1df\"><\/a><\/h4><p><a title=\"http:\/\/schoolofdata.org\/ (link no longer works)\">School of Data<\/a> hosts <a href=\"http:\/\/datahub.io\/dataset\/grain-landgrab-data\/resource\/af57b7b2-f4e7-4942-88d3-83912865d116\">GRAIN landgrab data<\/a> collected by an NGO. It is a typical messy dataset that requires some cleaning.<\/p><pre class=\"codeinput\"><span class=\"keyword\">if<\/span> exist(<span class=\"string\">'grain.xls'<\/span>, <span class=\"string\">'file'<\/span>) ~= 2                          <span class=\"comment\">% if file doesn't exist<\/span>\r\n    url = <span class=\"string\">'https:\/\/commondatastorage.googleapis.com\/ckannet-storage\/2012-08-14T085537\/GRAIN---Land-grab-deals---Jan-2012.xls'<\/span>;\r\n    websave(<span class=\"string\">'grain.xls'<\/span>, url);                              <span class=\"comment\">% save file from the web<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ndata = readtable(<span class=\"string\">'grain.xls'<\/span>, <span class=\"string\">'Range'<\/span>, <span class=\"string\">'A2:I417'<\/span>, <span class=\"keyword\">...<\/span><span class=\"comment\">       % load data from file<\/span>\r\n    <span class=\"string\">'ReadVariableNames'<\/span>, false, <span class=\"string\">'TextType'<\/span>, <span class=\"string\">'string'<\/span>);\r\n<\/pre><h4>Fixing Typos or Inconsistent Labeling<a name=\"5fd5415b-6a18-45d5-bf7b-6b7d3a3b85a5\"><\/a><\/h4><p>One common data cleaning issue is dealing with typos or inconsistent labeling. Let's take an example from the table column <tt>Landgrabber<\/tt>, which contains entity names. You see two spelling variants for the same entity.<\/p><pre class=\"codeinput\">entities = string(data.(2));                               <span class=\"comment\">% entity as a string array<\/span>\r\nentities([18,350])                                         <span class=\"comment\">% subset<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  2&times;1 string array\r\n    \"Almarai Co\"\r\n    \"Almarai Co.\"\r\n<\/pre><p>String arrays provide a variety of methods for efficiently manipulating text values, particularly when working with lots of text data. Here we'll use <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/endswith.html\">endsWith<\/a><\/tt> to find the names missing a period after 'Co'.<\/p><pre class=\"codeinput\">isCo = endsWith(entities,<span class=\"string\">'Co'<\/span>);                            <span class=\"comment\">% find all that ends with 'Co'<\/span>\r\nentities(isCo) = entities(isCo) + <span class=\"string\">'.'<\/span>;                     <span class=\"comment\">% add period<\/span>\r\nentities(isCo)                                             <span class=\"comment\">% check the result<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  8&times;1 string array\r\n    \"Almarai Co.\"\r\n    \"Shaanxi Kingbull Livestock Co.\"\r\n    \"Foras International Investment Co.\"\r\n    \"Foras International Investment Co.\"\r\n    \"Foras International Investment Co.\"\r\n    \"Foras International Investment Co.\"\r\n    \"Foras International Investment Co.\"\r\n    \"Foras International Investment Co.\"\r\n<\/pre><h4>Find and Convert Substrings<a name=\"353def77-a519-4592-9ff8-929a60baa406\"><\/a><\/h4><p>The table column <tt>ProjectedInvestment<\/tt> contains dollar amounts in both millions and billions as text. Let's use the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/contains.html\">contains<\/a><\/tt> methods to find where millions and billions are used.<\/p><pre class=\"codeinput\">investment = data.(7);                                      <span class=\"comment\">% subset a column<\/span>\r\ninvestment(1:5)                                             <span class=\"comment\">% preview the first 5 rows<\/span>\r\nisMillion = contains(investment,<span class=\"string\">'million'<\/span>);                 <span class=\"comment\">% find rows that contain substring<\/span>\r\nisBillion = contains(investment,<span class=\"string\">'billion'<\/span>);                 <span class=\"comment\">% find rows that contain substring<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  5&times;1 string array\r\n    \"\"\r\n    \"US$77 million\"\r\n    \"\"\r\n    \"US$30-35 million\"\r\n    \"US$200 million\"\r\n<\/pre><p>Let's use <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html\">regexp<\/a><\/tt> to extract numbers. When a range like 'US$30-35 million' is given, we will use the first number. String arrays work with regular expressions just like cell arrays of character vectors. Lastly, we'll remove commas with <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html\">replace<\/a><\/tt>.<\/p><pre class=\"codeinput\">pattern = <span class=\"string\">'\\d+\\.?,?\\d*'<\/span>;                                    <span class=\"comment\">% regex pattern<\/span>\r\nnum = regexp(investment, pattern, <span class=\"string\">'match'<\/span>, <span class=\"string\">'once'<\/span>);         <span class=\"comment\">% extract first matches<\/span>\r\nnum = replace(num, <span class=\"string\">','<\/span>, <span class=\"string\">''<\/span>);                                <span class=\"comment\">% remove commans<\/span>\r\nnum(1:5)                                                    <span class=\"comment\">% preview the first 5 rows<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  5&times;1 string array\r\n    &lt;missing&gt;\r\n    \"77\"\r\n    &lt;missing&gt;\r\n    \"30\"\r\n    \"200\"\r\n<\/pre><p>You notice that this regular expression call created <tt>&lt;missing&gt;<\/tt> values when it didn't find matches. This is the string equivalent to <tt>NaN<\/tt> in numeric arrays. We don't need to treat these missing values differently here since these missing values will convert to <tt>NaN<\/tt> when we cast the value to double, which is what we want. After casing to double, we'll adjust the scale of each value to be in the millions or billions.<\/p><pre class=\"codeinput\">num = double(num);                                          <span class=\"comment\">% convert to double<\/span>\r\nnum(isMillion) = num(isMillion) * 10^5;                     <span class=\"comment\">% adjust the unit<\/span>\r\nnum(isBillion) = num(isBillion) * 10^8;                     <span class=\"comment\">% adjust the unit<\/span>\r\nnum(1:5)                                                    <span class=\"comment\">% preview the first 5 rows<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n         NaN\r\n     7700000\r\n         NaN\r\n     3000000\r\n    20000000\r\n<\/pre><p>Now let's plot the result as histogram.<\/p><pre class=\"codeinput\">figure                                                      <span class=\"comment\">% new figure<\/span>\r\nhistogram(num)                                              <span class=\"comment\">% plot histogram<\/span>\r\nxlabel(<span class=\"string\">'Projected Investment (US$)'<\/span>)                        <span class=\"comment\">% add x-axis label<\/span>\r\nylabel(<span class=\"string\">'Count of Projects'<\/span>)                                 <span class=\"comment\">% add y-axis label<\/span>\r\ntitle(<span class=\"string\">'GRAIN Land Grab Dataset'<\/span>)                            <span class=\"comment\">% add title<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/introducing_string2_02.png\" alt=\"\"> <h4>Tokenization<a name=\"df521f3b-f989-4dd8-823a-7e286b962f7c\"><\/a><\/h4><p>One of the common approaches in text analytics is to count the occurrences of words. We can tokenize the whole string array and generate a list of unique words as a dictionary. Check out how string arrays seamlessly work with familiar functions like <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/lower.html\">lower<\/a><\/tt>, <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/ismember.html\">ismember<\/a><\/tt> or <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/unique.html\">unique<\/a><\/tt>. We can also use new functions like <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/erase.html\">erase<\/a><\/tt>. To standardize the word form we will deal with things like plurals and conjugations using the legacy <a href=\"http:\/\/tartarus.org\/martin\/PorterStemmer\/\">Porter Stemmer<\/a> code. It takes character vectors, so we'll need to convert strings to character vectors with <tt>char<\/tt> when we use it.<\/p><pre class=\"codeinput\">summary = data.(9);                                         <span class=\"comment\">% extract Summary<\/span>\r\ndelimiters = {<span class=\"string\">' '<\/span>,<span class=\"string\">','<\/span>,<span class=\"string\">'.'<\/span>,<span class=\"string\">'-'<\/span>,<span class=\"string\">'\"'<\/span>,<span class=\"string\">'%'<\/span>,<span class=\"string\">'('<\/span>,<span class=\"string\">')'<\/span>,<span class=\"string\">'&amp;'<\/span>,<span class=\"string\">'\/'<\/span>,<span class=\"string\">'$'<\/span>}; <span class=\"comment\">% characters to split with<\/span>\r\nstopwordsURL =<span class=\"string\">'http:\/\/www.textfixer.com\/resources\/common-english-words.txt'<\/span>;\r\nstopWords = urlread(stopwordsURL);                          <span class=\"comment\">% read stop words<\/span>\r\nstopWords = split(string(stopWords),<span class=\"string\">','<\/span>);                   <span class=\"comment\">% split stop words<\/span>\r\nstemmer_url = <span class=\"string\">'http:\/\/tartarus.org\/martin\/PorterStemmer\/matlab.txt'<\/span>;\r\n<span class=\"keyword\">if<\/span> exist(<span class=\"string\">'porterStemmer.m'<\/span>, <span class=\"string\">'file'<\/span>) ~= 2                    <span class=\"comment\">% if file doesn't exist<\/span>\r\n    websave(<span class=\"string\">'porterStemmer.txt'<\/span>,stemmer_url);               <span class=\"comment\">% save file from the web<\/span>\r\n    movefile(<span class=\"string\">'porterStemmer.txt'<\/span>,<span class=\"string\">'porterStemmer.m'<\/span>,<span class=\"string\">'f'<\/span>)     <span class=\"comment\">% rename file<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ntokens = cell(size(summary));                               <span class=\"comment\">% cell arrray as accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:length(summary)                                  <span class=\"comment\">% for each row in summary<\/span>\r\n    s = split(summary(ii), delimiters)';                    <span class=\"comment\">% split content by delimiters<\/span>\r\n    s = lower(s);                                           <span class=\"comment\">% use lowercase<\/span>\r\n    s = regexprep(s, <span class=\"string\">'[0-9]+'<\/span>,<span class=\"string\">''<\/span>);                          <span class=\"comment\">% remove numbers<\/span>\r\n    s(s == <span class=\"string\">''<\/span>) = [];                                        <span class=\"comment\">% remove empty strings<\/span>\r\n    s(ismember(s, stopWords)) = [];                         <span class=\"comment\">% remove stop words<\/span>\r\n    s = erase(s,<span class=\"string\">'''s'<\/span>);                                     <span class=\"comment\">% remove possessive s<\/span>\r\n    <span class=\"keyword\">for<\/span> jj = 1:length(s)                                    <span class=\"comment\">% for each word<\/span>\r\n        s(jj) = porterStemmer(char(s(jj)));                 <span class=\"comment\">% get the word stem<\/span>\r\n    <span class=\"keyword\">end<\/span>\r\n    tokens{ii} = s;                                         <span class=\"comment\">% add to the accumulator<\/span>\r\n<span class=\"keyword\">end<\/span>\r\ndict = unique([tokens{:}]);                                 <span class=\"comment\">% dictionary of unique words<\/span>\r\n<\/pre><h4>Document Term Frequency Matrix<a name=\"2696bbc8-48ae-4a51-995b-36e4bbef78e0\"><\/a><\/h4><p>Now we can count the number of occurrences of all words in the dictionary across all rows by creating the document term frequency matrix.<\/p><pre class=\"codeinput\">DTM = zeros(length(tokens),length(dict));                   <span class=\"comment\">% accumulator<\/span>\r\n<span class=\"keyword\">for<\/span> ii = 1:length(tokens)                                   <span class=\"comment\">% loop over tokens<\/span>\r\n    [words,~,idx] = unique(tokens{ii});                     <span class=\"comment\">% get uniqe words<\/span>\r\n    wcounts = accumarray(idx, 1);                           <span class=\"comment\">% get word counts<\/span>\r\n    cols = ismember(dict, words);                           <span class=\"comment\">% find cols for words<\/span>\r\n    DTM(ii,cols) = wcounts;                                 <span class=\"comment\">% unpdate dtm with word counts<\/span>\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><p>Let's plot the frequency of the top 20 stemmed words. We can do fancier analysis with document term frequency matrix, such as text classification, sentiment analysis and text mining.<\/p><p>Check out my post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">Can You Find Love through Text Analytics?<\/a>, <a href=\"https:\/\/blogs.mathworks.com\/loren\/2014\/06\/04\/analyzing-twitter-with-matlab\/\">Analyzing Twitter with MATLAB<\/a> and <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/09\/09\/text-mining-shakespeare-with-matlab\/\">Text Mining Shakespeare with MATLAB<\/a> and <a title=\"https:\/\/blogs.mathworks.com\/loren\/2016\/08\/08\/text-mining-machine-learning-research-papers-with-matlab\/ (link no longer works)\">Text Mining Machine Learning Research Papers with MATLAB<\/a> for more details.<\/p><pre class=\"codeinput\">[wc, ii] = sort(sum(DTM), <span class=\"string\">'descend'<\/span>);                       <span class=\"comment\">% sort dtm by word count<\/span>\r\nfigure                                                      <span class=\"comment\">% new figure<\/span>\r\nbar(wc(1:20))                                               <span class=\"comment\">% plot the top 20<\/span>\r\nax = gca;                                                   <span class=\"comment\">% get current axes hande<\/span>\r\nax.XTick = 1:20;                                            <span class=\"comment\">% set x axis tick<\/span>\r\nax.XTickLabel = dict(ii(1:20));                             <span class=\"comment\">% label x axis tick<\/span>\r\nax.XTickLabelRotation = 90;                                 <span class=\"comment\">% rotate x axis tick label<\/span>\r\nxlim([0 21])                                                <span class=\"comment\">% set x axis limits<\/span>\r\nxlabel(<span class=\"string\">'Words'<\/span>)                                             <span class=\"comment\">% add x axis label<\/span>\r\nylabel(<span class=\"string\">'Total Word Count'<\/span>)                                  <span class=\"comment\">% add y axis label<\/span>\r\ntitle(<span class=\"string\">'Top 20 Words in Summary Column'<\/span>)                     <span class=\"comment\">% add title<\/span>\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/introducing_string2_03.png\" alt=\"\"> <h4>Non-English Text<a name=\"7b1e1d51-9534-42a8-8767-42a484ee8c6f\"><\/a><\/h4><p>Let's not forget that we often deal with non-English text, especially if the data source is from the internet such as Twitter. Let's load the sample data from <a href=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/non_english.xlsx\">an Excel file<\/a> that contains text in French, Italian, German, Spanish, Chinese, Japanese and Korean (so-called <a href=\"https:\/\/en.wikipedia.org\/wiki\/FIGS\">FIGS<\/a> and <a href=\"https:\/\/en.wikipedia.org\/wiki\/CJK_characters\">CJK<\/a> text). Chinese and Japanese words for \"string\" seem to share a common character. Let's confirm this using <tt>contains<\/tt> we saw earlier.<\/p><pre class=\"codeinput\">[~, ~, nonenglish] = xlsread(<span class=\"string\">'non_english.xlsx'<\/span>);           <span class=\"comment\">% load text from Excel<\/span>\r\nnonenglish = string(nonenglish);                            <span class=\"comment\">% convert to string<\/span>\r\ndisp(nonenglish(:,[1,6:7]))                                 <span class=\"comment\">% preview 3 columns<\/span>\r\ncj_string = nonenglish(3,6:7);                              <span class=\"comment\">% \"string\" in Chinese and Japanese<\/span>\r\ncontains(cj_string(1), cj_string{2}(2))                     <span class=\"comment\">% is Japanese char in Chinese text?<\/span>\r\n<\/pre><pre class=\"codeoutput\">    \"English\"                 \"Chinese\"           \"Japanese\"            \r\n    \"English\"                 \"&#20013;&#25991;\"              \"&#26085;&#26412;&#35486;\"               \r\n    \"string\"                  \"&#23383;&#31526;&#20018;\"             \"&#25991;&#23383;&#21015;\"               \r\n    \"Do you speak MATLAB?\"    \"&#20320;&#20250;&#35828;MATLAB&#21527;?\"    \" &#12354;&#12394;&#12383;&#12399;MATLAB&#12434;&#35441;...\"\r\nans =\r\n  logical\r\n   1\r\n<\/pre><p>Korean, like English, uses white space to separate words. We can use that to split string into tokens.<\/p><pre class=\"codeinput\">kr_string = nonenglish(4,8);                                <span class=\"comment\">% \"Do you speak MATLAB\" in Korean<\/span>\r\nsplit(kr_string)                                            <span class=\"comment\">% split string by whitespace<\/span>\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n  3&times;1 string array\r\n    \"&#45817;&#49888;&#51008;\"\r\n    \"MATLAB&#51012;\"\r\n    \"&#47568;&#54633;&#45768;&#44620;?\"\r\n<\/pre><p>For languages that do not use whitespace characters to separate words, we would need to use specialized tools to split words, such as Japanese Morphological Analyzer MeCab discussed in my post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">Can You Find Love through Text Analytics?<\/a> and you can find more details about how to use it with MATLAB in <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/53-shift\">this File Exchange entry<\/a>.<\/p><h4>Summary<a name=\"66a37579-8c75-4441-a6d7-31b90e3d8318\"><\/a><\/h4><p>String arrays are a useful new data type for working with text data. String arrays behave more like numeric arrays, can make code more readable, and are more efficient for storing text data and performing string manipulations.<\/p><p>Try using string arrays with your text data instead of cell arrays of character vectors. Doing so will make your code clearer and concise especially if you take advantage of new functions. You can also avoid the need for <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/cellfun.html\">cellfun<\/a><\/tt> with function handles and the <tt>UniformOutput<\/tt> flag that cell arrays of character vectors often require.<\/p><p>Obviously I am pretty excited about string arrays. Play with string arrays and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=2041#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_587ed24299b94d289f758b936ddc17a9() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='587ed24299b94d289f758b936ddc17a9 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 587ed24299b94d289f758b936ddc17a9';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2016 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_587ed24299b94d289f758b936ddc17a9()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2016b<br><\/p><\/div><!--\r\n587ed24299b94d289f758b936ddc17a9 ##### SOURCE BEGIN #####\r\n%% Introducing String Arrays\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521 Toshi> is\r\n% back for today's guest post. You may have seen\r\n% <https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\r\n% Toshi's earlier posts about text analytics> and he often deals with text\r\n% in his data analysis. So he is very excited about new\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html string> arrays in\r\n% R2016b.\r\n% \r\n% <<joy.jpg>>\r\n% \r\n% One of the new features I love in R2016b is string arrays, which give you\r\n% a new way to handle text in MATLAB in addition to the familiar\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/characters-and-strings.html\r\n% character arrays and cell arrays of character vectors>. String arrays are\r\n% most helpful when dealing with text in your data. In today's post I walk\r\n% through some practical examples with text data to demonstrate how to use\r\n% string arrays.\r\n%\r\n%% Analyzing Baby Name Trends\r\n% Let's play with strings using the\r\n% <http:\/\/www.ssa.gov\/oact\/babynames\/limits.html baby names dataset> from\r\n% Social Security Administration. The data is stored in separate text files\r\n% by year of birth from 1880 to 2015. Let's begin by previewing one of\r\n% them.\r\n\r\nif ~isdir('names')                                          % if |names| folder doesn't exist\r\n    url = 'https:\/\/www.ssa.gov\/oact\/babynames\/names.zip';   % url of the zipped data file\r\n    unzip(url,'names')                                      % download and unzip data into |names| folder\r\nend\r\ntbl1880 = readtable('names\/yob1880.txt');                   % read the first file\r\nvars = {'name','sex','births'};                             % column names\r\ntbl1880.Properties.VariableNames = vars;                    % add column names\r\ndisp(tbl1880(1:5,:))                                        % preview 5 rows\r\n\r\n%% String Concatenation\r\n% Rather than loading each file into a separate table, we would like to\r\n% create a single table that spans all the years available. The files are\r\n% named with a convention: 'yob' + year + '.txt', which we can use to\r\n% generate the file paths. With a string array, we can take advantage of\r\n% array expansion to generate the list of filenames.\r\n\r\nyears = 1880:2015;                                          % vector of years in double\r\nfilepaths = string('names\/yob') + years + '.txt';           % concatenate string with numbers\r\nfilepaths(1:3)                                              % indexing into the first 3 elements\r\n\r\n%% Combine Files to Create a Single Table\r\n% Let's create a single table that spans all the years available. Note that\r\n% we need to use |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/char.html\r\n% char>| to convert the individual filename strings to character vectors\r\n% for use with |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\r\n% readtable>|. We'll set the |readtable| parameter\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html#namevaluepairs\r\n% TextType>| to |'string'| so that the text data is read into the table as\r\n% string arrays. When you preview the first five rows, notice that text is\r\n% surrounded by double quotes rather than single quotes, which indicates\r\n% they are represented as string arrays.\r\n\r\nnames = cell(length(years), 1);                             % accumulator\r\nfor ii = 1:length(years)                                    % for each year\r\n    names{ii} = readtable(char(filepaths(ii)), ...          % read individual files \r\n        'ReadVariableNames', false, ...                     % into separate tables\r\n        'TextType','string');                               % with text in string arrays\r\n    names{ii}.Properties.VariableNames = vars;              % add column names\r\n    names{ii}.year = repmat(years(ii), ...                  % add |year| column\r\n        height(names{ii}), 1);  \r\nend\r\nnames = vertcat(names{:});                                  % concatenate tables\r\ndisp(names(1:5,:))                                          % preview 5 rows\r\n\r\n%% String Comparison\r\n% Let's plot how the popularity of the names 'Jack' and 'Emily' have\r\n% changed over time. With string arrays, you can simply use the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/eq.html ==>| operator for\r\n% comparison. This makes our code clearer as compared to using\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/strcmp.html strcmp>|. We can\r\n% observe that Emily has seen a surge of popularity in recent years and\r\n% Jack is staging a comeback.\r\n\r\nJack = names(names.name == 'Jack', :);                      % rows named 'Jack' only\r\nEmily = names(names.name == 'Emily', :);                    % rows named 'Emily' only\r\nEmily = Emily(Emily.sex == 'F', :);                         % just girls\r\nJack = Jack(Jack.sex == 'M', :);                            % just boys\r\nfigure                                                      % new figure\r\nplot(Jack.year, Jack.births);                               % plot Jack\r\nhold on                                                     % don't overwrite\r\nplot(Emily.year, Emily.births);                             % plot Emily\r\nhold off                                                    % enable overwrite\r\ntitle('Baby Name Popularity');                              % add title\r\nxlabel('year'); ylabel('births');                           % add axis labels\r\nlegend('Jack', 'Emily', 'Location', 'NorthWest')            % add legend\r\n\r\n%% Memory Use and Performance of String Arrays vs. Cell Arrays\r\n% Let's consider the impact string arrays have on memory usage.\r\n\r\nnamesString = names.name;                                   % this is string\r\nnamesCellAr = cellstr(namesString);                         % convert to cellstr\r\nwhos('namesString', 'namesCellAr')                          % check size and type\r\n\r\n%% \r\n% The string array uses about half the memory of the cell array of\r\n% character vectors in this case. The memory savings depends on the array\r\n% data and size and is pronounced for arrays with many elements like this\r\n% one.\r\n% \r\n% In most cases, you can also achieve better performance when you use\r\n% string arrays with new string manipulation methods.\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html replace>| is a\r\n% new string method which you can often use in place of\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/strrep.html strrep>| for\r\n% replacing substrings of text. Notice the performance difference:\r\n\r\ntic, strrep(namesCellAr,'Joey','Joe'); toc,                 % time strrep operation\r\ntic, replace(namesString,'Joey','Joe'); toc,                % time replace operation\r\n\r\n%% Data Wrangling Example\r\n% <http:\/\/schoolofdata.org\/ School of Data> hosts\r\n% <http:\/\/datahub.io\/dataset\/grain-landgrab-data\/resource\/af57b7b2-f4e7-4942-88d3-83912865d116\r\n% GRAIN landgrab data> collected by an NGO. It is a typical messy dataset\r\n% that requires some cleaning.\r\n\r\nif exist('grain.xls', 'file') ~= 2                          % if file doesn't exist\r\n    url = 'https:\/\/commondatastorage.googleapis.com\/ckannet-storage\/2012-08-14T085537\/GRAINREPLACE_WITH_DASH_DASH-Land-grab-dealsREPLACE_WITH_DASH_DASH-Jan-2012.xls';\r\n    websave('grain.xls', url);                              % save file from the web\r\nend\r\ndata = readtable('grain.xls', 'Range', 'A2:I417', ...       % load data from file\r\n    'ReadVariableNames', false, 'TextType', 'string');   \r\n\r\n%% Fixing Typos or Inconsistent Labeling\r\n% One common data cleaning issue is dealing with typos or inconsistent\r\n% labeling. Let's take an example from the table column |Landgrabber|,\r\n% which contains entity names. You see two spelling variants for the same\r\n% entity.\r\n\r\nentities = string(data.(2));                               % entity as a string array\r\nentities([18,350])                                         % subset\r\n%% \r\n% String arrays provide a variety of methods for efficiently manipulating\r\n% text values, particularly when working with lots of text data. Here we'll\r\n% use |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/endswith.html endsWith>|\r\n% to find the names missing a period after 'Co'.\r\n\r\nisCo = endsWith(entities,'Co');                            % find all that ends with 'Co'\r\nentities(isCo) = entities(isCo) + '.';                     % add period\r\nentities(isCo)                                             % check the result\r\n\r\n%% Find and Convert Substrings\r\n% The table column |ProjectedInvestment| contains dollar amounts in both\r\n% millions and billions as text. Let's use the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/contains.html contains>|\r\n% methods to find where millions and billions are used.\r\n\r\ninvestment = data.(7);                                      % subset a column\r\ninvestment(1:5)                                             % preview the first 5 rows\r\nisMillion = contains(investment,'million');                 % find rows that contain substring\r\nisBillion = contains(investment,'billion');                 % find rows that contain substring\r\n%% \r\n% Let's use |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html regexp>|\r\n% to extract numbers. When a range like 'US$30-35 million' is given, we\r\n% will use the first number. String arrays work with regular expressions\r\n% just like cell arrays of character vectors. Lastly, we'll remove commas\r\n% with |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/replace.html replace>|.\r\n\r\npattern = '\\d+\\.?,?\\d*';                                    % regex pattern\r\nnum = regexp(investment, pattern, 'match', 'once');         % extract first matches\r\nnum = replace(num, ',', '');                                % remove commans\r\nnum(1:5)                                                    % preview the first 5 rows\r\n%% \r\n% You notice that this regular expression call created |&lt;missing&gt;|\r\n% values when it didn't find matches. This is the string equivalent to\r\n% |NaN| in numeric arrays. We don't need to treat these missing values\r\n% differently here since these missing values will convert to |NaN| when we\r\n% cast the value to double, which is what we want. After casing to double,\r\n% we'll adjust the scale of each value to be in the millions or billions.\r\n\r\nnum = double(num);                                          % convert to double\r\nnum(isMillion) = num(isMillion) * 10^5;                     % adjust the unit\r\nnum(isBillion) = num(isBillion) * 10^8;                     % adjust the unit\r\nnum(1:5)                                                    % preview the first 5 rows\r\n\r\n%% \r\n% Now let's plot the result as histogram.\r\n\r\nfigure                                                      % new figure\r\nhistogram(num)                                              % plot histogram\r\nxlabel('Projected Investment (US$)')                        % add x-axis label\r\nylabel('Count of Projects')                                 % add y-axis label\r\ntitle('GRAIN Land Grab Dataset')                            % add title\r\n\r\n%% Tokenization\r\n% One of the common approaches in text analytics is to count the\r\n% occurrences of words. We can tokenize the whole string array and generate\r\n% a list of unique words as a dictionary. Check out how string arrays\r\n% seamlessly work with familiar functions like\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/lower.html lower>|,\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/ismember.html ismember>| or\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/unique.html unique>|. We can\r\n% also use new functions like\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/erase.html erase>|. To\r\n% standardize the word form we will deal with things like plurals and\r\n% conjugations using the legacy <http:\/\/tartarus.org\/martin\/PorterStemmer\/\r\n% Porter Stemmer> code. It takes character vectors, so we'll need to\r\n% convert strings to character vectors with |char| when we use it.\r\n\r\nsummary = data.(9);                                         % extract Summary\r\ndelimiters = {' ',',','.','-','\"','%','(',')','&','\/','$'}; % characters to split with\r\nstopwordsURL ='http:\/\/www.textfixer.com\/resources\/common-english-words.txt';\r\nstopWords = urlread(stopwordsURL);                          % read stop words\r\nstopWords = split(string(stopWords),',');                   % split stop words\r\nstemmer_url = 'http:\/\/tartarus.org\/martin\/PorterStemmer\/matlab.txt';\r\nif exist('porterStemmer.m', 'file') ~= 2                    % if file doesn't exist\r\n    websave('porterStemmer.txt',stemmer_url);               % save file from the web\r\n    movefile('porterStemmer.txt','porterStemmer.m','f')     % rename file\r\nend\r\ntokens = cell(size(summary));                               % cell arrray as accumulator\r\nfor ii = 1:length(summary)                                  % for each row in summary\r\n    s = split(summary(ii), delimiters)';                    % split content by delimiters\r\n    s = lower(s);                                           % use lowercase\r\n    s = regexprep(s, '[0-9]+','');                          % remove numbers\r\n    s(s == '') = [];                                        % remove empty strings\r\n    s(ismember(s, stopWords)) = [];                         % remove stop words\r\n    s = erase(s,'''s');                                     % remove possessive s\r\n    for jj = 1:length(s)                                    % for each word\r\n        s(jj) = porterStemmer(char(s(jj)));                 % get the word stem\r\n    end\r\n    tokens{ii} = s;                                         % add to the accumulator    \r\nend\r\ndict = unique([tokens{:}]);                                 % dictionary of unique words\r\n\r\n%% Document Term Frequency Matrix\r\n% Now we can count the number of occurrences of all words in the dictionary\r\n% across all rows by creating the document term frequency matrix.\r\n\r\nDTM = zeros(length(tokens),length(dict));                   % accumulator\r\nfor ii = 1:length(tokens)                                   % loop over tokens\r\n    [words,~,idx] = unique(tokens{ii});                     % get uniqe words\r\n    wcounts = accumarray(idx, 1);                           % get word counts\r\n    cols = ismember(dict, words);                           % find cols for words\r\n    DTM(ii,cols) = wcounts;                                 % unpdate dtm with word counts\r\nend\r\n\r\n%% \r\n% Let's plot the frequency of the top 20 stemmed words. We can do fancier\r\n% analysis with document term frequency matrix, such as text\r\n% classification, sentiment analysis and text mining.\r\n% \r\n% Check out my post\r\n% <https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\r\n% Can You Find Love ","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2016\/introducing_string2_03.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/951521\">Toshi<\/a> is back for today's guest post. You may have seen <a href=\"https:\/\/blogs.mathworks.com\/loren\/2015\/04\/08\/can-you-find-love-through-text-analytics\/\">Toshi's earlier posts about text analytics<\/a> and he often deals with text in his data analysis. So he is very excited about new <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/string.html\">string<\/a> arrays in R2016b.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2016\/09\/15\/introducing-string-arrays\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2041"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=2041"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2041\/revisions"}],"predecessor-version":[{"id":3798,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2041\/revisions\/3798"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=2041"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=2041"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=2041"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}