Working with Text in MATLAB
I'd like to introduce today's guest blogger, Dave Bergstein, a MATLAB Product Manager at MathWorks. In today's post, Dave discusses recent updates to text processing with MATLAB.
Contents
In today's post I share a text processing example using the new string array and a collection of new text manipulation functions, both introduced in R2016b. I also give recommendations on when best to use string, char, or cell for text and share some of our thinking on the future.
Also be sure to check out Toshi's post Introducing String Arrays and Loren's post Singing the Praises of Strings.
Example: How Late Is My Bus?
My friend in New York City talks about the delays on her bus route. Let's look at some data to see what typical delays are for trains and buses. The Open NY Initiative shares data which includes over 150,000 public transit events spanning 6 years. I downloaded this data as a CSV file from: https://data.ny.gov/Transportation/511-NY-MTA-Events-Beginning-2010/i8wu-pqzv
Import the Data
I read the data into a table using readtable and specify the TextType name-value pair as string to read the text as string arrays.
data = readtable('511_NY_MTA_Events__Beginning_2010.csv','TextType','string');
Warning: Variable names were modified to make them valid MATLAB identifiers. The original names are saved in the VariableDescriptions property.
Here is a list of variables in the table:
data.Properties.VariableNames
ans = 1×13 cell array Columns 1 through 4 'EventType' 'OrganizationName' 'FacilityName' 'Direction' Columns 5 through 9 'City' 'County' 'State' 'CreateTime' 'CloseTime' Columns 10 through 13 'EventDescription' 'RespondingOrganiz…' 'Latitude' 'Longitude'
data.EventDescription is a string array which contains the event descriptions. Let's take a closer look at the events.
eventsStr = data.EventDescription;
Unlike character vectors or cell array of character vectors, each element of the string array is a string itself. See how I can index the string array just as I would a numeric array and get strings arrays back.
eventsStr(1:3)
ans = 3×1 string array "MTA NYC Transit Bus: due to Earlier flooding Q11 into Hamilton Beach normal service resumed" "MTA NYC Transit Bus: due to Construction, northbound M1 Bus area of 147th Street:Adam Clayton Powell Junior" "MTA NYC Transit Subway: due to Delays, Bronx Bound # 2 & 3 Lines at Nevins Street Station (Brooklyn)"
Many of the event descriptions report delays like 'operating 10 minutes late'. See for example how the 26-minute delay is reported in event 5180.
eventsStr(5180)
ans = "MTA Long Island Rail Road: due to Debris on tracks, westbound Montauk Branch between Montauk Station (Suffolk County) and Jamaica Station (Queens) The 6:44 AM from Montauk due Jamaica at 9:32 AM, is operating 26 minutes late due to an unauthorized vehicle on the tracks near Hampton Bays."
Identify Delays
I want to find all the events which contain ' late '. MATLAB R2016b also introduced more than a dozen new functions for working with text. These functions work with character vectors, cell arrays of character vectors, and string arrays. You can learn about these functions from the characters and strings page in our documentation.
I convert the text to all lowercase and determine which events contain ' late ' using the contains function.
eventsStr = lower(eventsStr);
idx = contains(eventsStr,' late ');
lateEvents = eventsStr(idx);
Extract the Delay Times
I extract the minutes late from phrases like 'operating 10 minutes late' using the functions extractAfter and extractBefore.
Let's look at the first late event. The exact phrase we are seeking doesn't appear in this event. When we look for the text following 'operating' we get back a missing string.
lateEvents(1)
extractAfter(lateEvents(1),'operating')
ans = "mta long island rail road: due to delays, westbound babylon branch between speonk station (speonk) and new york penn station (manhattan) the 5:08 a.m. departure due ny @ 7:02 a.m. is 15 minutes late @ babylon." ans = <missing>
Let's look at the second late event. This string contains the phrase 'operating 14 minutes late'. Extracting the text after 'operating' we get '14 minutes late due to signal problems'. Extracting the text before 'minutes late' we get back ' 14 ' which we can convert to a numeric value using double.
lateEvents(2) s = extractAfter(lateEvents(2),'operating') s = extractBefore(s,'minutes late') minLate = double(s)
ans = "mta long island rail road: due to delays westbound ronkonkoma branch out of bethpage station (suffolk county) the 8:01 am train due into penn station at 8:47 am is operating 14 minutes late due to signal problems" s = " 14 minutes late due to signal problems" s = " 14 " minLate = 14
Success! We extracted the train delay from the event description. Now let's put this all together. I extract the minutes late from all the events and drop the missing values using the rmmissing function. I then convert the remaining values to numbers using double and plot a histogram of the results.
s = extractAfter(lateEvents,'operating'); s = extractBefore(s,'minutes late'); s = rmmissing(s); minLate = double(s); histogram(minLate,0:5:40) ylabel('Number of Events') xlabel('Minutes Late') title({'Transit Delays','NY Metropolitan Transit Authority'})
It looks like reported delays are often 10-15 minutes. This simple routine captures many of the transit delays, but not all. The pattern doesn't always fit (consider again lateEvents(1)). I also left out any delays that may be reported in hours. Can you improve it?
Text as Data
String arrays are a great choice for text data like the example above because they are memory efficient and perform better than cell arrays of character vectors (previously known as cellstr).
Let's compare the memory usage. I convert the string array to a cell array of character vectors with the cellstr command and check the memory with whos. See the Bytes column - it shows the string array is about 12% more efficient.
eventsCell = cellstr(eventsStr);
whos events*
Name Size Bytes Class Attributes eventsCell 151225x1 73208886 cell eventsStr 151225x1 64662486 string
The memory savings can be much greater for many smaller pieces of text. for example, suppose I want to store each word as a separate array element. First I join all 150,000 reports into a single long string using the join function. I then split this long string on spaces using the split function. The result is a string array storing over 4 million words in separate elements. Here the memory savings is nearly 2X.
wordsStr = split(join(eventsStr));
wordsCell = split(join(eventsCell));
whos words*
Name Size Bytes Class Attributes wordsCell 4356256x1 535429652 cell wordsStr 4356256x1 284537656 string
String arrays also perform better. You can achieve the best performance using string arrays in combination with the text manipulation functions introduced in R2016b. Here I compare the performance of replace on a string array with that of strrep on a cell array of character vectors. See how replace with a string array is about 4X faster than strrep with a cell array.
f1 = @() replace(eventsStr,'delay','late'); f2 = @() strrep(eventsCell,'delay','late'); timeit(f1) timeit(f2)
ans = 0.062507 ans = 0.23239
Recommendations on Text Type
So, should you use string arrays for all your text? Maybe not yet. MATLAB has three different ways to store text:
- character vectors (char)
- string arrays (string)
- cell arrays of character vectors (cell)
For now (as of R2017a), we encourage you to use string arrays to store text data such as the transit events. We don’t recommend using string arrays elsewhere yet since string arrays aren’t yet accepted everywhere in MATLAB. Notice how I used a character vector for specifying the filename in readtable and a cell array of character vectors for the figure title.
Looking to the Future
What about in the future? We feel string arrays provide a better experience than character vectors and cell arrays of character vectors. Our plan is to roll out broader use of string arrays over time.
In the next few releases we will update more MATLAB functions and properties to accept string arrays in addition to character vectors and cell arrays of character vectors. As we do so, it will become easier for you to use string arrays in more places.
Next we will replace cell arrays of character vectors in MATLAB with string arrays. Note that cell arrays themselves aren't going anywhere. They are an important MATLAB container type and good for storing mixed data types or arrays of jagged size among other uses. But we expect their use for text data will diminish and become largely replaced by string arrays which are more memory efficient and perform better for pure text data.
Beyond that, over time, we will use string arrays in new functions and new properties in place of character vectors (but will continue returning character vectors in many places for compatibility). We expect character vectors will continue to live on for version-to-version code compatibility and special use cases.
Speaking of compatibility: we care deeply about version-to-version compatibility of MATLAB code, today more than ever. So, we are taking the following steps in our roll out of string arrays:
- Text manipulation functions (both old and new) return the text type they are passed. This means you can opt-in to using string with these functions (string use isn't necessary). Note how I used split and join above with either string arrays or cell arrays of character vectors.
- We are recommending string arrays today for text data applications. Here there are ways to opt-in to string use. In the example, I opted to get a string array from readtable using the TextType name-value pair. And string arrays were returned from functions like extractBefore because I passed a string array as input.
- We added curly brace indexing to string arrays which returns a character vector for compatibility. Cell arrays return their contents when you index with curly braces {}. Code that uses cell arrays of character vectors usually indexes the array with curly braces to access the character vector. Such code can work with string arrays since curly brace indexing will also return a character vector. See how the following code returns the same result whether f is a cell array or a string:
d = datetime('now'); f = {'h','m','s'}; % use a cell array for n = 1:3, d.Format = f{n}; disp(d) end
9 43 10
f = ["h","m","s"]; % use a string array for n = 1:3, d.Format = f{n}; disp(d) end
9 43 10
Expect to hear more from me on this topic. And please share your input with us by leaving a comment below. We're interested to hear from you.
We hope string arrays will help you accomplish your goals and that the steps we're taking provide a smooth adoption. If you haven't tried string arrays yet, learn more from our documentation on characters and strings.
- Category:
- Memory,
- New Feature,
- Performance,
- Strings