Loren on the Art of MATLAB

April 5th, 2006

MATLAB, Strings, and Regular Expressions

I'm pleased to introduce Jason Breslau, our guest blogger this week, who gives us his take on MATLAB, strings, and regular expressions.

When you think about text processing, you probably think about Perl, as well you should. Perl is the de facto standard in text processing, it was created for the task, and is fine tuned to make it as easy as possible. To continue with my presupposition of your train of thought, when you think of Perl, you most likely think of regular expressions. Subsequently, when you think of regular expressions, you probably think, "Yuck!"

For those of you unfamiliar with regular expressions, they provide a mechanism to describe patterns in text, for matching or replacing. They are generally considered very useful, yet ugly, and difficult to understand and use. Perl was created for processing text, and has regular expressions deeply ingrained into its language. Its most basic operators are match and substitute, both of which have regular expressions built into them. Virtually every Perl program uses regular expressions, and it is likely that thoughts of regular expressions lead to Perl, as much as the other way around.

One of MATLAB's hidden strengths is its ability to handle text processing. MATLAB supports all of the requisite file I/O functions, and provides a wide selection of string functions, but most importantly, MATLAB has builtin regular expressions.

Text processing plays right into MATLAB's forte: matrices. In Perl, strings are an atomic data type, which lends them to special care from the language. In MATLAB, strings are one dimensional matrices of type char, and MATLAB can treat them as it does other matrices. This is useful if you want to perform math on a string. This might sound preposterous if you are thinking about taking the eigenvalue of your text file, however useful math could be performed on text, for example with cryptography.

For example, you may have seen an article that has circulated on the internet that speculates that people can easily read text with each word scrambled, as long as the first and last letters of each word are preserved. Jamie Zawinski has written a Perl script to perform this scrambling on user input. Here is that script stripped to its core code:

while (<>) {
    foreach (split (/(\w+)/)) {
        if (m/\w/) {
            my @w = split (//);
            my $A = shift @w;
            my $Z = pop @w;
            print $A;
            if (defined ($Z)) {
                my $i = $#w+1;
                while ($i--) {
                    my $j = int rand ($i+1);
                    @w[$i,$j] = @w[$j,$i];
                }
                foreach (@w) {
                    print $_;
                }
                print $Z;
            }
        } else {
            print "$_";
        }
    }
}

In MATLAB it is possible to do this easily, using dynamic regular expressions, a new feature of regexp in Release 2006a.

while true
    line = input('', 's');
    line = regexprep(line, '(?<=\w)\w{2,}(?=\w)', '${$0(randperm(length($0)))}');
    disp(line);
end

Note that the regular expression is evaluating MATLAB code before replacing the text. This makes it possible to call MATLAB's randperm function.

With MATLAB's support of string functions, notably regexp, it is easier to use, and subsequently more effective than Perl for text processing.

Advantages to MATLAB regexp

There are several differences betwixt the Perl and MATLAB implementations of regular expressions. Most of these differences are nuances in the languages, such as default options and syntax for word breaks. Some advantageous aspects of regular expressions in MATLAB over Perl are named tokens and case preservation.

Named Tokens

I have found that once someone starts using regular expressions, a compelling feature they discover is tokens, also known as capture groups. Tokens are created by top level parenthetical subexpressions in a regular expression. These tokens are then available as outputs to the regexp function, as backreferences later in the same pattern, and as arguments that modify replacement text in regexprep. If you like using tokens in your regular expressions, you will love using named tokens.

The named tokens feature allows you to specify names for the parenthetical subexpressions that capture tokens. Then the tokens may be referred to by name, as opposed to by number. This makes expressions clearer, as well as less prone to bugs. This is because appending two expressions will change the indices of the tokens. For example, write an expression to extract the month, day and year from a date without using named tokens:

>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)', 'tokens');
>> date{:}

ans = 

    '11'    '26'    '1977'

If the above code were in a function, day would be referred to as date{2}. Converting this example to use named tokens:

>> date = regexp('11/26/1977', '(?<month>\d+)/(?<day>\d+)/(?<year>\d+)', 'names')

date = 

    month: '11'
      day: '26'
     year: '1977'

Now day may be referred to as date.day. What if the pattern also needs to match dates in the European format of dd.mm.yyyy? The first pattern could be written as:

>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}

ans = 

    '11'    '26'    '1977'

>> date = regexp('26.11.1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}

ans = 

    '26'    '11'    '1977'

But, see how the order of the tokens is indistinguishable. The same example can be fixed using named tokens:

>> date = regexp('11/26/1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')

date = 

    month: '11'
      day: '26'
     year: '1977'


>> date = regexp('26.11.1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')

date = 

    month: '11'
      day: '26'
     year: '1977'

As you can see, named tokens clarify the expression, results and surrounding code.

Case Preservation

Case insensitivity is such a fundamental aspect of pattern matching that MATLAB emphasizes it as a separate builtin function, regexpi. Many programs use this option in search and replace functions. Unfortunately, although these instances match case insensitively, they do not properly address the need to also have dynamic replacements. As an example, a word capitalized at the beginning of a sentence should remain capitalized after a replacement. MATLAB recognizes this need. In addition to providing options that will ignore the case and match the case, MATLAB also supplies the option to preserve the case. To support this, regexprep has the following options to handle case: matchcase, ignorecase and preservecase. The differences are shown in these examples:

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'matchcase')

ans =

The Car, The car, THE CAR, the car, a BOAT

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'ignorecase')

ans =

a BOAT, a BOAT, a BOAT, a BOAT, a BOAT

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'preservecase')

ans =

A Boat, A boat, A BOAT, a boat, a BOAT

Note how the preservecase option does what you would most likely want it to.

MATLAB for Regular Expressions

The MATLAB regular expression functions are fully featured and refined text processing tools. Along with the other support that MATLAB provides for text manipulation, the suite within MATLAB is the easiest and cleanest way to write string processing code. When you next think of text processing, think of MATLAB, and maybe you won't have to think, "Yuck!"

  • Do you use MATLAB yet defer to Perl for text processing?
  • Is there functionality that you use in Perl for text processing that is absent from MATLAB?
Tell me about it here.

 

32 Responses to “MATLAB, Strings, and Regular Expressions”

  1. Andrej Mosat replied on :

    It would be great to read a note on this page saying: “Don`t try this at home with Release

  2. Brad Phelan replied on :

    I’m not a big perl user but Ruby has something similar to dynamic regular expressions.

    puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/){ |m|
    m.gsub($1, $1.shuffle)
    }

    gives

    Hlelo Mhawrtoks, Matalb Is Amanizg

    The point is that Ruby allows you to pass a block to compute the subsitution. The block can be abitrarily complex. If I choose to inline the shuffle code I could nest my regular expression substitutions to achieve this affect.

    puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/)do |m|
    m.gsub!($1) do |x|
    1.upto(x.size) do |n|
    nn = rand(n)
    y = x[nn]
    x[nn]=”"
    x

  3. Brad Phelan replied on :

    Try again to post my code snippet to get the indenting right.

    puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/)do |m|
    m.gsub!($1) do |x|
    1.upto(x.size) do |n|
    nn = rand(n)
    y = x[nn]
    x[nn]=”"
    x

  4. Jason Breslau replied on :

    Dynamic regular expressions are a new feature in MATLAB with R2006a, sorry about any confusion with prior releases.

  5. SRL replied on :

    The blog states explicitly:

    In MATLAB it is possible to do this easily, using dynamic regular expressions, a new feature of regexp in Release 2006a.

  6. Loren replied on :

    That’s because I updated the blog, SRL, in order to help others.

  7. Michael replied on :

    Interesting article.. is it possible to do large scale text mining using Matlab? Would be interesting if one could write a spider to a certain number of web pages and have it count /analyze words and patterns.

  8. sy replied on :

    I use regular expressions in Matlab all the time. But most often, I apply them to text files. The way I have to do this is to read the entire file into a string and do my processing on that string. This becomes painful for large files, as I have to read in pieces of the file at a time. Maybe there is a better way of doing this in 2006a (I am still using R14 sp1). If there is not, it would be useful in the future to have a regular expression function that could accept a file name instead of a string. I think that would save me a lot.

    SY

  9. Loren replied on :

    SY-

    You might take a look at memmapfile, depending on the format of your file.

  10. Kelly replied on :

    I’m in the “use MATLAB yet defer to Perl for text processing” camp. Or rather, as sy states, for text file processing. Matlab’s regular expression functions are certainly useful in certain situations, but when parsing large files, Perl still comes out ahead. I work often with an accoustic model. All my wrapper scripts for this model are in Matlab except those that parse the output file, a text file consisting of various tables (with varying numbers of columns) interspersed with text, in no particular order. Perl’s ability to quicky read and parse a file line by line, along with it’s extemely useful split operator, make it a better tool for the job, in my opinion.

  11. Nasser replied on :

    Does Matlab regular expressions support unicode in terms of manipulating other languages such as Arabic

  12. Jason Breslau replied on :

    MATLAB Regular Expressions fully support Unicode.

  13. Shivaram replied on :

    I have used Matlab for converting html files into Matrix. I had not used “Regular Expresions” for processing. The information given above will be very helpful for my future work. Thank you.

  14. Ranjib Dey replied on :

    Is there any matlab function like ’split’ in perl, using which i can get an array out of a string

  15. Loren replied on :

    Ranjib-

    In addition to regexp, have a look at strtok to see if that helps you out.

    –Loren

  16. Jason Breslau replied on :

    There is a new split option for regexp in the R2007b release.

    See the documentation here.

  17. Gohar replied on :

    Thanks Loren,
    for such useful article. i have tried to use Matlab for regex instead of Perl but there is one problem (because of my project nature) i need to access (html)files from remote location or web site for parsing.

    Because, i have no idea how grab web page text(html) data in matlab for string manipulations?

  18. Loren replied on :

    Gohar-

    Look in the document, for example here to see information on functions such as urlread.

    –Loren

  19. Gohar replied on :

    is there any easy method to calculate code running time in mili seconds?

  20. Loren replied on :

    Gohar-

    Please contact technical support. This question is off-topic.

    Thank you.
    –Loren

  21. Dmitry Markman replied on :

    Gohar-
    check out
    tic / toc functions

  22. Rob replied on :

    Splitting is not really the problem.

    But how do you automatically have an numeric array after parsing lines like:

    123 | 456 | 6.4|asdd
    124 | 457 | 6.8|asdd

    result should be something like:

    [123 456 6.4 NA; 124 457 6.8 NA; ]

    Rob.

  23. Jason Breslau replied on :

    Hi Rob,

    My initial thought was that you should use textscan for your problem. Unfortunately, textscan will stop processing if it encounters data that it can not convert, as opposed to the NaNs that you want.

    You can get your result using a couple of splits:

    Create some sample data:

    >> str = sprintf(’123 | 456 | 6.4|asdd\n124 | 457 | 6.8|asdd\n125 | 458 | 7.0|qwerty’)
    str =
    123 | 456 | 6.4|asdd
    124 | 457 | 6.8|asdd
    125 | 458 | 7.0|qwerty

    Split the data on newlines to find the rows:

    >> rows = regexp(str, ‘\n’, ’split’)
    rows =
    [1×20 char] [1×20 char] [1×22 char]

    Now split the rows into individual cells:

    >> cells = regexp(rows, ‘\s*\|\s*’, ’split’)
    cells =
    {1×4 cell} {1×4 cell} {1×4 cell}

    Since cells is now a 1×3 cell of 1×4 cells, vertically concatenate them to recreate the original shape of the data:

    >> cells = vertcat(cells{:})
    cells =
    ‘123′ ‘456′ ‘6.4′ ‘asdd’
    ‘124′ ‘457′ ‘6.8′ ‘asdd’
    ‘125′ ‘458′ ‘7.0′ ‘qwerty’

    Now str2double will do the rest of the work:

    >> str2double(cells)
    ans =
    123.0000 456.0000 6.4000 NaN
    124.0000 457.0000 6.8000 NaN
    125.0000 458.0000 7.0000 NaN

    -=>J

  24. per isakson replied on :

    The flavour of regular expressions, which is implemented in Matlab, is that identical (or close) to a flavour that one can find i a book like “Mastering Regular Expressions” or supported by a tool like RegexBuddy, http://www.regexbuddy.com/?

  25. Jason Breslau replied on :

    MATLAB’s regular expressions are very similar to what you will find described in “Mastering Regular Expressions”. RegexBuddy supports many flavors, the most similar to MATLAB being Perl and Java. I highly recommend the “Mastering Regular Expressions” book. The book mentions what features and syntaxes are flavor specific, and often presents tables of the various languages and their support of a specific feature. MATLAB’s syntax will generally be one of those. The details of MATLAB’s regular expressions are found here: http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/matlab_prog/f0-42649.html

    -=>J

  26. David replied on :

    strread

  27. JB replied on :

    Please, enlighten me as to eleganlty peform the following:

    I have an ascii text file (notepad) with lines of values with the following format.

    first= 320 second= 234566 third= -0.24 …
    first= 788 second= 256667 third= -0.24 …
    and so on.

    I anticipated using code like the following to create a matrix of the numeric values in my text file that match my patterns.

    fid = fopen(’accuSUMok.txt’);
    i = 1;
    while 1
    tline = fgetl(fid);
    pat = ‘ first=\s+\d+’;
    a = regexp(tline,pat,’match’);
    disp(a);
    pat = ‘\d+’;
    c = regexp(a,pat,’match’);
    disp(c);
    pat = ‘ third=\s+\-0\.\d+’;
    b = regexp(tline,pat,’match’);
    disp(b);
    pat =’\-0\.\d+’;
    d = regexp(b,pat,’match’);
    disp(d);
    j = [ a c b d]
    %printf(’%s %s’, a, b, c, d);
    if ~ischar(tline), break, end
    end
    fclose(fid);

    But what I see is, many lines of, with the error at end of file.:

    ‘ first= 325′

    {1×1 cell}

    ‘ third= -0.36′

    {1×1 cell}

    ??? Undefined function or method ‘regexp’ for input arguments of type ‘double’.

    Error in ==> reg at 7
    a = regexp(tline,pat,’match’);

    What can I do to elegantly put the numeric values into an array?
    tia

  28. JB replied on :

    simplifying:
    fid = fopen(’ascii.txt’);
    while 1
    tline = fgetl(fid);
    pat = ‘ first=\s+(\d+)’;
    a = regexp(tline,pat,’match’);
    a(:)
    pat = ‘\d+’;
    c = regexp(a,pat,’match’);
    c(:)
    if ~ischar(tline), break, end
    end
    fclose(fid);
    ~~~~~~~~~~~~~~
    produces
    ans =
    ‘ first= 325′
    ans =
    {1×1 cell}
    ans =
    ‘ first= 325′
    ans =
    {1×1 cell}
    ??? Undefined function or method ‘regexp’ for input arguments of type ‘double’.
    Error in ==> reg at 7
    a = regexp(tline,pat,’match’);

    What’s the appropriate, elegant way to get at the numeric value?

  29. Loren replied on :

    Here’s the corrected link for tokens.

    –Loren

  30. Jason Breslau replied on :

    Hi JB,

    You need to modify your end-of-file check to be at the beginning of your while loop, not at the end.

    The error you are getting is because you are passing -1 (end-of-file) to regexp.

    -=>J

  31. Alexander Panchenko replied on :

    Could you please give a pointer so the documentation of the language of the regular expresions supported in Matlab where one can see to what extend it supports the Perl dialect?

  32. Loren replied on :

    Alexander-

    The documentation covers what MATLAB supports.

    –Loren

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Jun: I totally can not believe it, Loren. You are really helpful. Thank you so much, MATLAB master!
  • Loren: Wow folks- Always lots of interest when there’s a quickie to try out! I will only make 2 general...
  • Loren: Jun- ismember is your friend here: >> [aa,ind] = ismember(Array2,Arra y1) aa = 1 1 1 1 1 1 1 ind = 1 2 1 4 4 3...
  • Dan: I like the first way better than the second way. Combining the arrays into one and running any is nice, although...
  • James Myatt: How about I = (a == 0 | b == 0); a(I) = []; b(I) = [];
  • Tunc: Hello Loren, love your blog because of such inspiring and challenging comments to such ’small’...
  • Pekka Kumpulainen: Here is my tradeoff. I usually want to keep the original variables as they are most probably...
  • Iain: Followup: Of course, to allow NaNs (counting them as non-zero): mask = (a~=0) & (b~=0); The mask says “a...
  • Matt Fig: I would usually go with something like this: y = a&b; x = a(y); y = b(y); But I was surprised to find...
  • kk: c=all([a;b]) a(c) a(b)

These postings are the author's and don't necessarily represent the opinions of The MathWorks.