MATLAB, Strings, and Regular Expressions

Posted by Loren Shure, April 5, 2006

22 views (last 30 days) | 0 Likes | 40 comments

I'm pleased to introduce Jason Breslau, our guest blogger this week, who gives us his take on MATLAB, strings, and regular expressions.

When you think about text processing, you probably think about Perl, as well you should. Perl is the de facto standard in text processing, it was created for the task, and is fine tuned to make it as easy as possible. To continue with my presupposition of your train of thought, when you think of Perl, you most likely think of regular expressions. Subsequently, when you think of regular expressions, you probably think, "Yuck!"

For those of you unfamiliar with regular expressions, they provide a mechanism to describe patterns in text, for matching or replacing. They are generally considered very useful, yet ugly, and difficult to understand and use. Perl was created for processing text, and has regular expressions deeply ingrained into its language. Its most basic operators are match and substitute, both of which have regular expressions built into them. Virtually every Perl program uses regular expressions, and it is likely that thoughts of regular expressions lead to Perl, as much as the other way around.

One of MATLAB's hidden strengths is its ability to handle text processing. MATLAB supports all of the requisite file I/O functions, and provides a wide selection of string functions, but most importantly, MATLAB has builtin regular expressions.

Text processing plays right into MATLAB's forte: matrices. In Perl, strings are an atomic data type, which lends them to special care from the language. In MATLAB, strings are one dimensional matrices of type char, and MATLAB can treat them as it does other matrices. This is useful if you want to perform math on a string. This might sound preposterous if you are thinking about taking the eigenvalue of your text file, however useful math could be performed on text, for example with cryptography.

For example, you may have seen an article that has circulated on the internet that speculates that people can easily read text with each word scrambled, as long as the first and last letters of each word are preserved. Jamie Zawinski has written a Perl script to perform this scrambling on user input. Here is that script stripped to its core code:

while (<>) {
    foreach (split (/(\w+)/)) {
        if (m/\w/) {
            my @w = split (//);
            my $A = shift @w;
            my $Z = pop @w;
            print $A;
            if (defined ($Z)) {
                my $i = $#w+1;
                while ($i--) {
                    my $j = int rand ($i+1);
                    @w[$i,$j] = @w[$j,$i];
                }
                foreach (@w) {
                    print $_;
                }
                print $Z;
            }
        } else {
            print "$_";
        }
    }
}

In MATLAB it is possible to do this easily, using dynamic regular expressions, a new feature of regexp in Release 2006a.

while true
    line = input('', 's');
    line = regexprep(line, '(?<=\w)\w{2,}(?=\w)', '${$0(randperm(length($0)))}');
    disp(line);
end

Note that the regular expression is evaluating MATLAB code before replacing the text. This makes it possible to call MATLAB's randperm function.

With MATLAB's support of string functions, notably regexp, it is easier to use, and subsequently more effective than Perl for text processing.

Advantages to MATLAB `regexp`

There are several differences betwixt the Perl and MATLAB implementations of regular expressions. Most of these differences are nuances in the languages, such as default options and syntax for word breaks. Some advantageous aspects of regular expressions in MATLAB over Perl are named tokens and case preservation.

Named Tokens

I have found that once someone starts using regular expressions, a compelling feature they discover is tokens, also known as capture groups. Tokens are created by top level parenthetical subexpressions in a regular expression. These tokens are then available as outputs to the regexp function, as backreferences later in the same pattern, and as arguments that modify replacement text in regexprep. If you like using tokens in your regular expressions, you will love using named tokens.

The named tokens feature allows you to specify names for the parenthetical subexpressions that capture tokens. Then the tokens may be referred to by name, as opposed to by number. This makes expressions clearer, as well as less prone to bugs. This is because appending two expressions will change the indices of the tokens. For example, write an expression to extract the month, day and year from a date without using named tokens:

>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)', 'tokens');
>> date{:}

ans = 

    '11'    '26'    '1977'

If the above code were in a function, day would be referred to as date{2}. Converting this example to use named tokens:

>> date = regexp('11/26/1977', '(?<month>\d+)/(?<day>\d+)/(?<year>\d+)', 'names')

date = 

    month: '11'
      day: '26'
     year: '1977'

Now day may be referred to as date.day. What if the pattern also needs to match dates in the European format of dd.mm.yyyy? The first pattern could be written as:

>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}

ans = 

    '11'    '26'    '1977'

>> date = regexp('26.11.1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}

ans = 

    '26'    '11'    '1977'

But, see how the order of the tokens is indistinguishable. The same example can be fixed using named tokens:

>> date = regexp('11/26/1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')

date = 

    month: '11'
      day: '26'
     year: '1977'


>> date = regexp('26.11.1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')

date = 

    month: '11'
      day: '26'
     year: '1977'

As you can see, named tokens clarify the expression, results and surrounding code.

Case Preservation

Case insensitivity is such a fundamental aspect of pattern matching that MATLAB emphasizes it as a separate builtin function, regexpi. Many programs use this option in search and replace functions. Unfortunately, although these instances match case insensitively, they do not properly address the need to also have dynamic replacements. As an example, a word capitalized at the beginning of a sentence should remain capitalized after a replacement. MATLAB recognizes this need. In addition to providing options that will ignore the case and match the case, MATLAB also supplies the option to preserve the case. To support this, regexprep has the following options to handle case: matchcase, ignorecase and preservecase. The differences are shown in these examples:

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'matchcase')

ans =

The Car, The car, THE CAR, the car, a BOAT

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'ignorecase')

ans =

a BOAT, a BOAT, a BOAT, a BOAT, a BOAT

>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'preservecase')

ans =

A Boat, A boat, A BOAT, a boat, a BOAT

Note how the preservecase option does what you would most likely want it to.

MATLAB for Regular Expressions

The MATLAB regular expression functions are fully featured and refined text processing tools. Along with the other support that MATLAB provides for text manipulation, the suite within MATLAB is the easiest and cleanest way to write string processing code. When you next think of text processing, think of MATLAB, and maybe you won't have to think, "Yuck!"