Loren on the Art of MATLAB

Turn ideas into MATLAB

Working with Text in MATLAB 6

Posted by Loren Shure,

I'd like to introduce today's guest blogger, Dave Bergstein, a MATLAB Product Manager at MathWorks. In today's post, Dave discusses recent updates to text processing with MATLAB.

Contents

In today's post I share a text processing example using the new string array and a collection of new text manipulation functions, both introduced in R2016b. I also give recommendations on when best to use string, char, or cell for text and share some of our thinking on the future.

Also be sure to check out Toshi's post Introducing String Arrays and Loren's post Singing the Praises of Strings.

Example: How Late Is My Bus?

My friend in New York City talks about the delays on her bus route. Let's look at some data to see what typical delays are for trains and buses. The Open NY Initiative shares data which includes over 150,000 public transit events spanning 6 years. I downloaded this data as a CSV file from: https://data.ny.gov/Transportation/511-NY-MTA-Events-Beginning-2010/i8wu-pqzv

Import the Data

I read the data into a table using readtable and specify the TextType name-value pair as string to read the text as string arrays.

data = readtable('511_NY_MTA_Events__Beginning_2010.csv','TextType','string');
Warning: Variable names were modified to make them valid MATLAB
identifiers. The original names are saved in the VariableDescriptions
property. 

Here is a list of variables in the table:

data.Properties.VariableNames
ans =
  1×13 cell array
  Columns 1 through 4
    'EventType'    'OrganizationName'    'FacilityName'    'Direction'
  Columns 5 through 9
    'City'    'County'    'State'    'CreateTime'    'CloseTime'
  Columns 10 through 13
    'EventDescription'    'RespondingOrganiz…'    'Latitude'    'Longitude'

data.EventDescription is a string array which contains the event descriptions. Let's take a closer look at the events.

eventsStr = data.EventDescription;

Unlike character vectors or cell array of character vectors, each element of the string array is a string itself. See how I can index the string array just as I would a numeric array and get strings arrays back.

eventsStr(1:3)
ans = 
  3×1 string array
    "MTA NYC Transit Bus: due to Earlier flooding Q11 into Hamilton Beach normal service resumed"
    "MTA NYC Transit Bus: due to Construction, northbound M1 Bus area of 147th Street:Adam Clayton Powell Junior"
    "MTA NYC Transit Subway: due to Delays, Bronx Bound # 2 & 3 Lines at Nevins Street Station (Brooklyn)"

Many of the event descriptions report delays like 'operating 10 minutes late'. See for example how the 26-minute delay is reported in event 5180.

eventsStr(5180)
ans = 
    "MTA Long Island Rail Road: due to Debris on tracks, westbound Montauk Branch between Montauk Station (Suffolk County)  and Jamaica Station (Queens)  The 6:44 AM from Montauk due Jamaica at 9:32 AM, is operating 26 minutes late due to an unauthorized vehicle on the tracks near Hampton Bays."

Identify Delays

I want to find all the events which contain ' late '. MATLAB R2016b also introduced more than a dozen new functions for working with text. These functions work with character vectors, cell arrays of character vectors, and string arrays. You can learn about these functions from the characters and strings page in our documentation.

I convert the text to all lowercase and determine which events contain ' late ' using the contains function.

eventsStr = lower(eventsStr);
idx = contains(eventsStr,' late ');
lateEvents = eventsStr(idx);

Extract the Delay Times

I extract the minutes late from phrases like 'operating 10 minutes late' using the functions extractAfter and extractBefore.

Let's look at the first late event. The exact phrase we are seeking doesn't appear in this event. When we look for the text following 'operating' we get back a missing string.

lateEvents(1)
extractAfter(lateEvents(1),'operating')
ans = 
    "mta long island rail road: due to delays, westbound babylon branch between speonk station (speonk)  and new york penn station (manhattan)  the 5:08 a.m. departure due ny @ 7:02 a.m. is 15 minutes late @ babylon."
ans = 
    <missing>

Let's look at the second late event. This string contains the phrase 'operating 14 minutes late'. Extracting the text after 'operating' we get '14 minutes late due to signal problems'. Extracting the text before 'minutes late' we get back ' 14 ' which we can convert to a numeric value using double.

lateEvents(2)
s = extractAfter(lateEvents(2),'operating')
s = extractBefore(s,'minutes late')
minLate = double(s)
ans = 
    "mta long island rail road: due to delays westbound ronkonkoma branch out of bethpage station (suffolk county) the 8:01 am train due into penn station at 8:47 am is operating 14 minutes late due to signal problems"
s = 
    " 14 minutes late due to signal problems"
s = 
    " 14 "
minLate =
    14

Success! We extracted the train delay from the event description. Now let's put this all together. I extract the minutes late from all the events and drop the missing values using the rmmissing function. I then convert the remaining values to numbers using double and plot a histogram of the results.

s = extractAfter(lateEvents,'operating');
s = extractBefore(s,'minutes late');
s = rmmissing(s);
minLate = double(s);

histogram(minLate,0:5:40)
ylabel('Number of Events')
xlabel('Minutes Late')
title({'Transit Delays','NY Metropolitan Transit Authority'})

It looks like reported delays are often 10-15 minutes. This simple routine captures many of the transit delays, but not all. The pattern doesn't always fit (consider again lateEvents(1)). I also left out any delays that may be reported in hours. Can you improve it?

Text as Data

String arrays are a great choice for text data like the example above because they are memory efficient and perform better than cell arrays of character vectors (previously known as cellstr).

Let's compare the memory usage. I convert the string array to a cell array of character vectors with the cellstr command and check the memory with whos. See the Bytes column - it shows the string array is about 12% more efficient.

eventsCell = cellstr(eventsStr);
whos events*
  Name                 Size               Bytes  Class     Attributes

  eventsCell      151225x1             73208886  cell                
  eventsStr       151225x1             64662486  string              

The memory savings can be much greater for many smaller pieces of text. for example, suppose I want to store each word as a separate array element. First I join all 150,000 reports into a single long string using the join function. I then split this long string on spaces using the split function. The result is a string array storing over 4 million words in separate elements. Here the memory savings is nearly 2X.

wordsStr = split(join(eventsStr));
wordsCell = split(join(eventsCell));
whos words*
  Name                 Size                Bytes  Class     Attributes

  wordsCell      4356256x1             535429652  cell                
  wordsStr       4356256x1             284537656  string              

String arrays also perform better. You can achieve the best performance using string arrays in combination with the text manipulation functions introduced in R2016b. Here I compare the performance of replace on a string array with that of strrep on a cell array of character vectors. See how replace with a string array is about 4X faster than strrep with a cell array.

f1 = @() replace(eventsStr,'delay','late');
f2 = @() strrep(eventsCell,'delay','late');
timeit(f1)
timeit(f2)
ans =
     0.062507
ans =
      0.23239

Recommendations on Text Type

So, should you use string arrays for all your text? Maybe not yet. MATLAB has three different ways to store text:

  • character vectors (char)
  • string arrays (string)
  • cell arrays of character vectors (cell)

For now (as of R2017a), we encourage you to use string arrays to store text data such as the transit events. We don’t recommend using string arrays elsewhere yet since string arrays aren’t yet accepted everywhere in MATLAB. Notice how I used a character vector for specifying the filename in readtable and a cell array of character vectors for the figure title.

Looking to the Future

What about in the future? We feel string arrays provide a better experience than character vectors and cell arrays of character vectors. Our plan is to roll out broader use of string arrays over time.

In the next few releases we will update more MATLAB functions and properties to accept string arrays in addition to character vectors and cell arrays of character vectors. As we do so, it will become easier for you to use string arrays in more places.

Next we will replace cell arrays of character vectors in MATLAB with string arrays. Note that cell arrays themselves aren't going anywhere. They are an important MATLAB container type and good for storing mixed data types or arrays of jagged size among other uses. But we expect their use for text data will diminish and become largely replaced by string arrays which are more memory efficient and perform better for pure text data.

Beyond that, over time, we will use string arrays in new functions and new properties in place of character vectors (but will continue returning character vectors in many places for compatibility). We expect character vectors will continue to live on for version-to-version code compatibility and special use cases.

Speaking of compatibility: we care deeply about version-to-version compatibility of MATLAB code, today more than ever. So, we are taking the following steps in our roll out of string arrays:

  1. Text manipulation functions (both old and new) return the text type they are passed. This means you can opt-in to using string with these functions (string use isn't necessary). Note how I used split and join above with either string arrays or cell arrays of character vectors.
  2. We are recommending string arrays today for text data applications. Here there are ways to opt-in to string use. In the example, I opted to get a string array from readtable using the TextType name-value pair. And string arrays were returned from functions like extractBefore because I passed a string array as input.
  3. We added curly brace indexing to string arrays which returns a character vector for compatibility. Cell arrays return their contents when you index with curly braces {}. Code that uses cell arrays of character vectors usually indexes the array with curly braces to access the character vector. Such code can work with string arrays since curly brace indexing will also return a character vector. See how the following code returns the same result whether f is a cell array or a string:
d = datetime('now');
f = {'h','m','s'};   % use a cell array
for n = 1:3,
    d.Format = f{n};
    disp(d)
end
   9
   43
   10
f = ["h","m","s"];   % use a string array
for n = 1:3,
    d.Format = f{n};
    disp(d)
end
   9
   43
   10

Expect to hear more from me on this topic. And please share your input with us by leaving a comment below. We're interested to hear from you.

We hope string arrays will help you accomplish your goals and that the steps we're taking provide a smooth adoption. If you haven't tried string arrays yet, learn more from our documentation on characters and strings.


Get the MATLAB code

Published with MATLAB® R2017a

6 CommentsOldest to Newest

James Tursa replied on : 1 of 6

How is “string” data physically stored in the variable? E.g., suppose we have several lines of characters (same length) stored in a cell array and stored as a char array.

mycell(1:10,1) = {‘This is a string’};
mychar = cell2mat(mycell);

In mychar, there is only one mxArray (variable header structure) in the background and all of the char data is physically stored in memory with no data sharing. In mycell, there is the mxArray for the cell array itself and each element has its own mxArray and char data as well. These element mxArray variables may or may not be shared so they may or may not be taking up additional memory (in the above example they are shared by design). Certain operations on the variable can cause more sharing to take place, or completely undo the current sharing. So getting an actual memory usage comparison is complicated and cannot always be done reliably with the whos command because data sharing is not accounted for with that command.

So, how does the character data stored in string variables compare to this? Why is it more “memory efficient” and “faster” than cell arrays of strings? And is there potential data sharing going on in the background that the whos command will not reveal? If there is potential data sharing, will it go away with “save” and “load” commands like it does for cell arrays of strings? Etc, etc.

These types of questions may seem to get at the implementation details too much, but for the programmer working with large data sets it can become of prime importance to know the answers in order to write code that is fast and resource efficient, and to know why some operations might suddenly run slow or blow up the memory.

David Barry replied on : 2 of 6

I’m going to miss using regular expressions… said no one ever! Great work with the string class.

Jotaf replied on : 3 of 6

Nice addition! What makes cell strings awkward for these tasks is that it’s a heterogenous container, and it’s easier to vectorize code with arrays (not just for performance but for concise and expressive code). It’s good that many str* functions support cell strings but often you have to break out the good old cellfun to get the same effect, and it’s just not as readable.

Dave Bergstein replied on : 4 of 6

@David and @Jotaf – Thanks for your comments!

@James – Thanks for your comment. As you suspect, it’s a complicated topic. Sharing is an important principle of our implementation and we employ it widely. In our internal tests, string arrays use significantly less memory than cellstrs. This is in large part due to the overhead savings on each element in the cell array. Note I compare cellstr and string arrays since both store many pieces of text. The savings is different for a single char, which is already fairly efficient for storing a single piece of text. As for speed, we invested effort to optimize text operations for all three text types. I didn’t mention it in the post, but cellstr operations are generally faster in recent releases than they had been. That said, we spent more effort optimizing string operations and string use in the new text manipulation functions. I know this doesn’t fully answer your question. But I appreciate your interest in this topic. Perhaps we can devote another blog post here or elsewhere to achieving top performance.

Vincent Scalfani replied on : 6 of 6

Thanks for the great post and work with strings. I am excited for the new text capabilities within Matlab. I have been using regular expressions to extract data from text files. One super useful feature for me is the extractBetween function. It makes it so easy to parse and extract data from text files. I almost feel like it is cheating and I should stop and go back to using regexp!!!! I have some more testing to do, but right now it feels like extractBetween is much faster than using regexp.

Add A Comment

What is 2 + 8?

Preview: hide