Loren on the Art of MATLAB

MATLAB, Strings, and Regular Expressions 41

Posted by Loren Shure,

I'm pleased to introduce Jason Breslau, our guest blogger this week, who gives us his take on MATLAB, strings, and regular expressions.

When you think about text processing, you probably think about Perl, as well you should. Perl is the de facto standard in text processing, it was created for the task, and is fine tuned to make it as easy as possible. To continue with my presupposition of your train of thought, when you think of Perl, you most likely think of regular expressions. Subsequently, when you think of regular expressions, you probably think, "Yuck!"

For those of you unfamiliar with regular expressions, they provide a mechanism to describe patterns in text, for matching or replacing. They are generally considered very useful, yet ugly, and difficult to understand and use. Perl was created for processing text, and has regular expressions deeply ingrained into its language. Its most basic operators are match and substitute, both of which have regular expressions built into them. Virtually every Perl program uses regular expressions, and it is likely that thoughts of regular expressions lead to Perl, as much as the other way around.

One of MATLAB's hidden strengths is its ability to handle text processing. MATLAB supports all of the requisite file I/O functions, and provides a wide selection of string functions, but most importantly, MATLAB has builtin regular expressions.

Text processing plays right into MATLAB's forte: matrices. In Perl, strings are an atomic data type, which lends them to special care from the language. In MATLAB, strings are one dimensional matrices of type char, and MATLAB can treat them as it does other matrices. This is useful if you want to perform math on a string. This might sound preposterous if you are thinking about taking the eigenvalue of your text file, however useful math could be performed on text, for example with cryptography.

For example, you may have seen an article that has circulated on the internet that speculates that people can easily read text with each word scrambled, as long as the first and last letters of each word are preserved. Jamie Zawinski has written a Perl script to perform this scrambling on user input. Here is that script stripped to its core code:

while (<>) {
    foreach (split (/(\w+)/)) {
        if (m/\w/) {
            my @w = split (//);
            my $A = shift @w;
            my $Z = pop @w;
            print $A;
            if (defined ($Z)) {
                my $i = $#w+1;
                while ($i--) {
                    my $j = int rand ($i+1);
                    @w[$i,$j] = @w[$j,$i];
                }
                foreach (@w) {
                    print $_;
                }
                print $Z;
            }
        } else {
            print "$_";
        }
    }
}

In MATLAB it is possible to do this easily, using dynamic regular expressions, a new feature of regexp in Release 2006a.

while true
    line = input('', 's');
    line = regexprep(line, '(?<=\w)\w{2,}(?=\w)', '${$0(randperm(length($0)))}');
    disp(line);
end

Note that the regular expression is evaluating MATLAB code before replacing the text. This makes it possible to call MATLAB's randperm function.

With MATLAB's support of string functions, notably regexp, it is easier to use, and subsequently more effective than Perl for text processing.

Advantages to MATLAB regexp

There are several differences betwixt the Perl and MATLAB implementations of regular expressions. Most of these differences are nuances in the languages, such as default options and syntax for word breaks. Some advantageous aspects of regular expressions in MATLAB over Perl are named tokens and case preservation.

Named Tokens

I have found that once someone starts using regular expressions, a compelling feature they discover is tokens, also known as capture groups. Tokens are created by top level parenthetical subexpressions in a regular expression. These tokens are then available as outputs to the regexp function, as backreferences later in the same pattern, and as arguments that modify replacement text in regexprep. If you like using tokens in your regular expressions, you will love using named tokens.

The named tokens feature allows you to specify names for the parenthetical subexpressions that capture tokens. Then the tokens may be referred to by name, as opposed to by number. This makes expressions clearer, as well as less prone to bugs. This is because appending two expressions will change the indices of the tokens. For example, write an expression to extract the month, day and year from a date without using named tokens:

>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)', 'tokens');
>> date{:}

ans = 

    '11'    '26'    '1977'

If the above code were in a function, day would be referred to as date{2}. Converting this example to use named tokens:

>> date = regexp('11/26/1977', '(?<month>\d+)/(?<day>\d+)/(?<year>\d+)', 'names')

date = 

    month: '11'
      day: '26'
     year: '1977'

Now day may be referred to as date.day. What if the pattern also needs to match dates in the European format of dd.mm.yyyy? The first pattern could be written as:

>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}

ans = 

    '11'    '26'    '1977'

>> date = regexp('26.11.1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}

ans = 

    '26'    '11'    '1977'

But, see how the order of the tokens is indistinguishable. The same example can be fixed using named tokens:

>> date = regexp('11/26/1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')

date = 

    month: '11'
      day: '26'
     year: '1977'


>> date = regexp('26.11.1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')

date = 

    month: '11'
      day: '26'
     year: '1977'

As you can see, named tokens clarify the expression, results and surrounding code.

Case Preservation

Case insensitivity is such a fundamental aspect of pattern matching that MATLAB emphasizes it as a separate builtin function, regexpi. Many programs use this option in search and replace functions. Unfortunately, although these instances match case insensitively, they do not properly address the need to also have dynamic replacements. As an example, a word capitalized at the beginning of a sentence should remain capitalized after a replacement. MATLAB recognizes this need. In addition to providing options that will ignore the case and match the case, MATLAB also supplies the option to preserve the case. To support this, regexprep has the following options to handle case: matchcase, ignorecase and preservecase. The differences are shown in these examples:

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'matchcase')

ans =

The Car, The car, THE CAR, the car, a BOAT

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'ignorecase')

ans =

a BOAT, a BOAT, a BOAT, a BOAT, a BOAT

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'preservecase')

ans =

A Boat, A boat, A BOAT, a boat, a BOAT

Note how the preservecase option does what you would most likely want it to.

MATLAB for Regular Expressions

The MATLAB regular expression functions are fully featured and refined text processing tools. Along with the other support that MATLAB provides for text manipulation, the suite within MATLAB is the easiest and cleanest way to write string processing code. When you next think of text processing, think of MATLAB, and maybe you won't have to think, "Yuck!"

  • Do you use MATLAB yet defer to Perl for text processing?
  • Is there functionality that you use in Perl for text processing that is absent from MATLAB?
Tell me about it here.

 

41 CommentsOldest to Newest

I’m not a big perl user but Ruby has something similar to dynamic regular expressions.

puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/){ |m|
m.gsub($1, $1.shuffle)
}

gives

Hlelo Mhawrtoks, Matalb Is Amanizg

The point is that Ruby allows you to pass a block to compute the subsitution. The block can be abitrarily complex. If I choose to inline the shuffle code I could nest my regular expression substitutions to achieve this affect.

puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/)do |m|
m.gsub!($1) do |x|
1.upto(x.size) do |n|
nn = rand(n)
y = x[nn]
x[nn]=”"
x

Try again to post my code snippet to get the indenting right.

puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/)do |m|
m.gsub!($1) do |x|
1.upto(x.size) do |n|
nn = rand(n)
y = x[nn]
x[nn]=”"
x

Dynamic regular expressions are a new feature in MATLAB with R2006a, sorry about any confusion with prior releases.

The blog states explicitly:

In MATLAB it is possible to do this easily, using dynamic regular expressions, a new feature of regexp in Release 2006a.

Interesting article.. is it possible to do large scale text mining using Matlab? Would be interesting if one could write a spider to a certain number of web pages and have it count /analyze words and patterns.

I use regular expressions in Matlab all the time. But most often, I apply them to text files. The way I have to do this is to read the entire file into a string and do my processing on that string. This becomes painful for large files, as I have to read in pieces of the file at a time. Maybe there is a better way of doing this in 2006a (I am still using R14 sp1). If there is not, it would be useful in the future to have a regular expression function that could accept a file name instead of a string. I think that would save me a lot.

SY

I’m in the “use MATLAB yet defer to Perl for text processing” camp. Or rather, as sy states, for text file processing. Matlab’s regular expression functions are certainly useful in certain situations, but when parsing large files, Perl still comes out ahead. I work often with an accoustic model. All my wrapper scripts for this model are in Matlab except those that parse the output file, a text file consisting of various tables (with varying numbers of columns) interspersed with text, in no particular order. Perl’s ability to quicky read and parse a file line by line, along with it’s extemely useful split operator, make it a better tool for the job, in my opinion.

Does Matlab regular expressions support unicode in terms of manipulating other languages such as Arabic

I have used Matlab for converting html files into Matrix. I had not used “Regular Expresions” for processing. The information given above will be very helpful for my future work. Thank you.

Is there any matlab function like ‘split’ in perl, using which i can get an array out of a string

Ranjib-

In addition to regexp, have a look at strtok to see if that helps you out.

–Loren

Thanks Loren,
for such useful article. i have tried to use Matlab for regex instead of Perl but there is one problem (because of my project nature) i need to access (html)files from remote location or web site for parsing.

Because, i have no idea how grab web page text(html) data in matlab for string manipulations?

Splitting is not really the problem.

But how do you automatically have an numeric array after parsing lines like:

123 | 456 | 6.4|asdd
124 | 457 | 6.8|asdd

result should be something like:

[123 456 6.4 NA; 124 457 6.8 NA; ]

Rob.

Hi Rob,

My initial thought was that you should use textscan for your problem. Unfortunately, textscan will stop processing if it encounters data that it can not convert, as opposed to the NaNs that you want.

You can get your result using a couple of splits:

Create some sample data:

>> str = sprintf(’123 | 456 | 6.4|asdd\n124 | 457 | 6.8|asdd\n125 | 458 | 7.0|qwerty’)
str =
123 | 456 | 6.4|asdd
124 | 457 | 6.8|asdd
125 | 458 | 7.0|qwerty

Split the data on newlines to find the rows:

>> rows = regexp(str, ‘\n’, ‘split’)
rows =
[1x20 char] [1x20 char] [1x22 char]

Now split the rows into individual cells:

>> cells = regexp(rows, ‘\s*\|\s*’, ‘split’)
cells =
{1×4 cell} {1×4 cell} {1×4 cell}

Since cells is now a 1×3 cell of 1×4 cells, vertically concatenate them to recreate the original shape of the data:

>> cells = vertcat(cells{:})
cells =
’123′ ’456′ ’6.4′ ‘asdd’
’124′ ’457′ ’6.8′ ‘asdd’
’125′ ’458′ ’7.0′ ‘qwerty’

Now str2double will do the rest of the work:

>> str2double(cells)
ans =
123.0000 456.0000 6.4000 NaN
124.0000 457.0000 6.8000 NaN
125.0000 458.0000 7.0000 NaN

-=>J

The flavour of regular expressions, which is implemented in Matlab, is that identical (or close) to a flavour that one can find i a book like “Mastering Regular Expressions” or supported by a tool like RegexBuddy, http://www.regexbuddy.com/?

MATLAB’s regular expressions are very similar to what you will find described in “Mastering Regular Expressions”. RegexBuddy supports many flavors, the most similar to MATLAB being Perl and Java. I highly recommend the “Mastering Regular Expressions” book. The book mentions what features and syntaxes are flavor specific, and often presents tables of the various languages and their support of a specific feature. MATLAB’s syntax will generally be one of those. The details of MATLAB’s regular expressions are found here: http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/matlab_prog/f0-42649.html

-=>J

Please, enlighten me as to eleganlty peform the following:

I have an ascii text file (notepad) with lines of values with the following format.

first= 320 second= 234566 third= -0.24 …
first= 788 second= 256667 third= -0.24 …
and so on.

I anticipated using code like the following to create a matrix of the numeric values in my text file that match my patterns.

fid = fopen(‘accuSUMok.txt’);
i = 1;
while 1
tline = fgetl(fid);
pat = ‘ first=\s+\d+’;
a = regexp(tline,pat,’match’);
disp(a);
pat = ‘\d+’;
c = regexp(a,pat,’match’);
disp(c);
pat = ‘ third=\s+\-0\.\d+’;
b = regexp(tline,pat,’match’);
disp(b);
pat =’\-0\.\d+’;
d = regexp(b,pat,’match’);
disp(d);
j = [ a c b d]
%printf(‘%s %s’, a, b, c, d);
if ~ischar(tline), break, end
end
fclose(fid);

But what I see is, many lines of, with the error at end of file.:

‘ first= 325′

{1×1 cell}

‘ third= -0.36′

{1×1 cell}

??? Undefined function or method ‘regexp’ for input arguments of type ‘double’.

Error in ==> reg at 7
a = regexp(tline,pat,’match’);

What can I do to elegantly put the numeric values into an array?
tia

simplifying:
fid = fopen(‘ascii.txt’);
while 1
tline = fgetl(fid);
pat = ‘ first=\s+(\d+)’;
a = regexp(tline,pat,’match’);
a(:)
pat = ‘\d+’;
c = regexp(a,pat,’match’);
c(:)
if ~ischar(tline), break, end
end
fclose(fid);
~~~~~~~~~~~~~~
produces
ans =
‘ first= 325′
ans =
{1×1 cell}
ans =
‘ first= 325′
ans =
{1×1 cell}
??? Undefined function or method ‘regexp’ for input arguments of type ‘double’.
Error in ==> reg at 7
a = regexp(tline,pat,’match’);

What’s the appropriate, elegant way to get at the numeric value?

Hi JB,

You need to modify your end-of-file check to be at the beginning of your while loop, not at the end.

The error you are getting is because you are passing -1 (end-of-file) to regexp.

-=>J

Could you please give a pointer so the documentation of the language of the regular expresions supported in Matlab where one can see to what extend it supports the Perl dialect?

I am unable to get numbers from tokens. Say token c is shown as 8X42 (cellarray?). How do I get a number corresponding to say element in 7th row and 31st column?

How do I use str2double? For a 1X1 cellarray
str2double(c{1}{1});
seems to work as an argument to str2double. I cannot figure out what would work for later elements of the cellarray.

Figured out just now. For getting the number from the 7th row and 31st column of a cellarray, use

str2double(c{7,31}{1})

long live braces. Thanks for the platform.

I occasionally need to use regexp() within MATLAB. When I do it’s usually a frustrating experience. The documentation for regular expressions is by necessity & by nature really long & really dense! The doc page: User’s Guide\Programming Fundamentals\ Basic Program Components\Regular Expressions is 48 letter-size pages long, 13,000 words!!

Consequently, If I’m doing anything even slightly non-trivial, it can be very time-consuming to work out the proper match expression by trial-and-error at the MATLAB command-line.

In desperation I Googled for a Reg. Expr. utility & found a free .NET-based offering called Expresso. This is a multi-pane app (similar in look to MATLAB) that allows one to see (in a tree view) how match expressions parse out. It’s really lowered the stress-level for me & speeds things up dramatically when I have to use regexp() in MATLAB.

IMHO, the Mathworks should consider offering some kind of analyzer like Expresso within the MATLAB environment. Otherwise, in my experience, reg. expr’s can be a real bottleneck for rapid code development.

Any comments appreciated,
Respectfully,
Brad

Brad-

I have the same troubles with regexp as you describe and would love to see a GUI that could help me out! Please make this into an enhancement request by using the support link on the right side of my blog. Thanks!

–Loren

Hi Loren, thanks for your empathy & encouragement. I’ve submitted an enhancement request & referred to our dialogue here as supporting evidence. Fingers crossed! ;)

Happy holidays,
Brad

Although this post is from 2006, it’s now 2011, so we’ve caught up!

The same regex can be used in Perl: they just didn’t because it’s horrible :) Clear code is preferable unless you’re intentionally obfuscating.

use List::Util qw(shuffle); #List::Util is core
while (<>) {
  s{(?<=\w)(\w{2,})(?=\w)}{join '', shuffle split //, $1}e;
  print $_, "\n";
}

Code evaluation like this has always been possible in Perl 5 as far as I can tell.

We also have named capture groups, which are stored in the special hash %+ since changing the way regexes return would introduce inconsistency:

(dump function provided by Data::Dump)

use 5.010;
my @date = '11/26/1977' =~ m{(\d+)/(\d+)/(\d+)};
dump \@date;

# [11, 26, 1977]


'11/26/1977' =~ m{(?\d+)/(?\d+)/(?\d+)};
my %date = %+;

dump \%date;

# { day => 26, month => 11, year => 1977 }

This is a feature of 5.10. Although many places are still running 5.8.8 it’s hardly Perl’s fault that people can’t keep up :) 5.10 is itself end-of-life; 5.14 is current and 5.12 is considered old.

Case preservation is a thing we don’t do. I suspect if you ask one of the core team why they will give you one of two answers:

1. We don’t know of anyone who has ever wanted it
2. We can’t make it work in a consistent way.

Kirk out

Loren,

Something’s not working for me.

Why does the following use of a dynamic regular expression not return the same result as its static equivalent? Am I missing something about when dynamic expressions can be used in R2011a?

>> tst=’4 A B C D 3′;
>> ptrn=’^(\d)(\s+[A-D]){4}’;
>> [m,t]=regexpi(tst,ptrn,’match’,'tokens’,'once’)

m =

4 A B C D

t =

’4′ ‘ A B C D’

>> ptrn=’^(\d)(\s+[A-D]){(??$1)}’

ptrn =

^(\d)(\s+[A-D]){(??$1)}

>> [m,t]=regexpi(tst,ptrn,’match’,'tokens’,'once’)

m =

t =

{}

NOTE: The only thing that has changed is that I’ve replaced the number 4 with the dynamic regular expression (??$1) that should equate to exactly the same character ’4′, as per the tokens shown above. I also have the same problem when I use a dynamic function call, as in (?@return_same_character($1)). The token is definitely captured, but it seems like the string is not being properly updated before the final matching is attempted.

Thanks for you help on this; I’m stymied.

Regards,
Adrian

This is a limitation of dynamic expressions which is not well documented.

The expression generated dynamically needs to be a complete expression. What this means is that if you take the pattern returned by the dynamic operator, it should be able to match as a standalone expression.

The dynamic portion of your pattern produces the expression “4″ which will match the number “4″, as opposed to being used as a quantifier for the greater expression which contains it. The way you handle this is to generate more of your expression as part of the dynamic component:

>> ptrn = ‘^(\d)((??(\\s+[A-D]){$1}))’;
>> [m,t]=regexpi(tst,ptrn,’match’,'tokens’,'once’)

m =

4 A B C D

t =

’4′ ‘ A B C D’

A few things to note here:

1) Changing the dynamic component to (??{$1}) is not enough, as it is still not a complete expression, just the quantifier.

2) You have to escape the \s in the expression, as the entire subexpression is parsed like a replace string for regexprep.

3) To capture the second portion, I needed to add another set of parenthesis, as dynamic expressions can not create new capturing groups.

I hope that helps,

-=>J

These postings are the author's and don't necessarily represent the opinions of MathWorks.