Loren on the Art of MATLAB
April 5th, 2006
MATLAB, Strings, and Regular Expressions
I'm pleased to introduce Jason Breslau, our guest blogger this week, who gives us his take on MATLAB, strings, and regular expressions.
When you think about text processing, you probably think about Perl, as well
you should. Perl is the de facto standard in text processing, it was created
for the task, and is fine tuned to make it as easy as possible. To continue
with my presupposition of your train of thought, when you think of Perl, you
most likely think of regular expressions. Subsequently, when you think of
regular expressions, you probably think, "Yuck!"
For those of you unfamiliar with regular
expressions, they provide a mechanism to describe patterns in text, for
matching or replacing. They are generally considered very useful, yet ugly,
and difficult to understand and use. Perl was created for processing text,
and has regular expressions deeply ingrained into its language. Its most
basic operators are match and substitute, both of which have regular
expressions built into them. Virtually every Perl program uses regular
expressions, and it is likely that thoughts of regular expressions lead to
Perl, as much as the other way around.
One of MATLAB's hidden strengths is its ability to handle text processing.
MATLAB supports all of the requisite file
I/O functions, and provides a wide selection of string
functions, but most importantly, MATLAB has builtin regular
expressions.
Text processing plays right into MATLAB's forte: matrices. In Perl, strings
are an atomic data type, which lends them to special care from the language.
In MATLAB, strings are one dimensional matrices of type char, and MATLAB can
treat them as it does other matrices. This is useful if you want to perform
math on a string. This might sound preposterous if you are thinking about
taking the eigenvalue of your text file, however useful math could be
performed on text, for example with cryptography.
For example, you may have seen an article
that has circulated on the internet that speculates that people can easily
read text with each word scrambled, as long as the first and last letters of
each word are preserved. Jamie Zawinski has written a Perl script to perform
this scrambling on user input. Here is that script stripped to its core code:
while (<>) {
foreach (split (/(\w+)/)) {
if (m/\w/) {
my @w = split (//);
my $A = shift @w;
my $Z = pop @w;
print $A;
if (defined ($Z)) {
my $i = $#w+1;
while ($i--) {
my $j = int rand ($i+1);
@w[$i,$j] = @w[$j,$i];
}
foreach (@w) {
print $_;
}
print $Z;
}
} else {
print "$_";
}
}
}
In MATLAB it is possible to do this easily, using dynamic
regular expressions, a new feature of regexp in Release 2006a.
while true
line = input('', 's');
line = regexprep(line, '(?<=\w)\w{2,}(?=\w)', '${$0(randperm(length($0)))}');
disp(line);
end
Note that the regular expression is evaluating MATLAB code before replacing
the text. This makes it possible to call MATLAB's randperm
function.
With MATLAB's support of string functions, notably regexp, it is easier to
use, and subsequently more effective than Perl for text processing.
Advantages to MATLAB regexp
There are several differences betwixt the Perl and MATLAB implementations of
regular expressions. Most of these differences are nuances in the languages,
such as default options and syntax for word breaks. Some advantageous
aspects of regular expressions in MATLAB over Perl are named tokens and case
preservation.
Named Tokens
|
I have found that once someone starts using regular expressions, a
compelling feature they discover is tokens, also known as capture groups. Tokens
are created by top level parenthetical subexpressions in a regular
expression. These tokens are then available as outputs to the regexp
function, as backreferences later in the same pattern, and as arguments that
modify replacement text in regexprep.
If you like using tokens in your regular expressions, you will love using named
tokens.
The named tokens feature allows you to specify names for the parenthetical
subexpressions that capture tokens. Then the tokens may be referred to by
name, as opposed to by number. This makes expressions clearer, as well as
less prone to bugs. This is because appending two expressions will change the
indices of the tokens. For example, write an expression to extract the
month, day and year from a date without using named tokens:
>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)', 'tokens');
>> date{:}
ans =
'11' '26' '1977'
If the above code were in a function, day would be referred to as
date{2}. Converting this example to use named tokens:
>> date = regexp('11/26/1977', '(?<month>\d+)/(?<day>\d+)/(?<year>\d+)', 'names')
date =
month: '11'
day: '26'
year: '1977'
Now day may be referred to as date.day. What if the pattern also
needs to match dates in the European format of dd.mm.yyyy? The first
pattern could be written as:
>> date = regexp('11/26/1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}
ans =
'11' '26' '1977'
>> date = regexp('26.11.1977', '(\d+)/(\d+)/(\d+)|(\d+).(\d+).(\d+)', 'tokens');
>> date{:}
ans =
'26' '11' '1977'
But, see how the order of the tokens is indistinguishable. The same example
can be fixed using named tokens:
>> date = regexp('11/26/1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')
date =
month: '11'
day: '26'
year: '1977'
>> date = regexp('26.11.1977', ...
'(?<month>\d+)/(?<day>\d+)/(?<year>\d+)|(?<day>\d+).(?<month>\d+).(?<year>\d+)',...
'names')
date =
month: '11'
day: '26'
year: '1977'
As you can see, named tokens clarify the expression, results and surrounding code.
|
Case Preservation
|
Case insensitivity is such a fundamental aspect of pattern matching that
MATLAB emphasizes it as a separate builtin function, regexpi.
Many programs use this option in search and replace functions.
Unfortunately, although these instances match case insensitively, they do
not properly address the need to also have dynamic replacements. As an
example, a word capitalized at the beginning of a sentence should remain
capitalized after a replacement. MATLAB recognizes this need. In addition to
providing options that will ignore the case and match the case, MATLAB also
supplies the option to preserve the case. To support this,
regexprep has the following options to handle case:
matchcase, ignorecase and preservecase. The
differences are shown in these examples:
>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'matchcase')
ans =
The Car, The car, THE CAR, the car, a BOAT
>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'ignorecase')
ans =
a BOAT, a BOAT, a BOAT, a BOAT, a BOAT
>> regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'preservecase')
ans =
A Boat, A boat, A BOAT, a boat, a BOAT
Note how the preservecase option does what you would most likely
want it to.
|
MATLAB for Regular Expressions
The MATLAB regular expression functions are fully featured and refined text
processing tools. Along with the other support that MATLAB provides for text
manipulation, the suite within MATLAB is the easiest and cleanest way to
write string processing code. When you next think of text processing, think
of MATLAB, and maybe you won't have to think, "Yuck!"
-
Do you use MATLAB yet defer to Perl for text processing?
-
Is there functionality that you use in Perl for text processing that is absent from MATLAB?
Tell me about it here.
00:00 UTC |
Posted in Strings |
Permalink |
You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.
Leave a Reply
|
 |
Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics. 
|
 |
|
It would be great to read a note on this page saying: “Don`t try this at home with Release
I’m not a big perl user but Ruby has something similar to dynamic regular expressions.
puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/){ |m|
m.gsub($1, $1.shuffle)
}
gives
Hlelo Mhawrtoks, Matalb Is Amanizg
The point is that Ruby allows you to pass a block to compute the subsitution. The block can be abitrarily complex. If I choose to inline the shuffle code I could nest my regular expression substitutions to achieve this affect.
puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/)do |m|
m.gsub!($1) do |x|
1.upto(x.size) do |n|
nn = rand(n)
y = x[nn]
x[nn]=”"
x
Try again to post my code snippet to get the indenting right.
puts “Hello Mathworks, Matlab Is Amazing”.gsub(/\w(\w+)\w/)do |m|
m.gsub!($1) do |x|
1.upto(x.size) do |n|
nn = rand(n)
y = x[nn]
x[nn]=”"
x
Dynamic regular expressions are a new feature in MATLAB with R2006a, sorry about any confusion with prior releases.
The blog states explicitly:
In MATLAB it is possible to do this easily, using dynamic regular expressions, a new feature of regexp in Release 2006a.
That’s because I updated the blog, SRL, in order to help others.
Interesting article.. is it possible to do large scale text mining using Matlab? Would be interesting if one could write a spider to a certain number of web pages and have it count /analyze words and patterns.
I use regular expressions in Matlab all the time. But most often, I apply them to text files. The way I have to do this is to read the entire file into a string and do my processing on that string. This becomes painful for large files, as I have to read in pieces of the file at a time. Maybe there is a better way of doing this in 2006a (I am still using R14 sp1). If there is not, it would be useful in the future to have a regular expression function that could accept a file name instead of a string. I think that would save me a lot.
SY
SY-
You might take a look at memmapfile, depending on the format of your file.
I’m in the “use MATLAB yet defer to Perl for text processing” camp. Or rather, as sy states, for text file processing. Matlab’s regular expression functions are certainly useful in certain situations, but when parsing large files, Perl still comes out ahead. I work often with an accoustic model. All my wrapper scripts for this model are in Matlab except those that parse the output file, a text file consisting of various tables (with varying numbers of columns) interspersed with text, in no particular order. Perl’s ability to quicky read and parse a file line by line, along with it’s extemely useful split operator, make it a better tool for the job, in my opinion.
Does Matlab regular expressions support unicode in terms of manipulating other languages such as Arabic
MATLAB Regular Expressions fully support Unicode.
I have used Matlab for converting html files into Matrix. I had not used “Regular Expresions” for processing. The information given above will be very helpful for my future work. Thank you.
Is there any matlab function like ’split’ in perl, using which i can get an array out of a string
Ranjib-
In addition to regexp, have a look at strtok to see if that helps you out.
–Loren
There is a new split option for regexp in the R2007b release.
See the documentation here.
Thanks Loren,
for such useful article. i have tried to use Matlab for regex instead of Perl but there is one problem (because of my project nature) i need to access (html)files from remote location or web site for parsing.
Because, i have no idea how grab web page text(html) data in matlab for string manipulations?
Gohar-
Look in the document, for example here to see information on functions such as urlread.
–Loren
is there any easy method to calculate code running time in mili seconds?
Gohar-
Please contact technical support. This question is off-topic.
Thank you.
–Loren
Gohar-
check out
tic / toc functions
Splitting is not really the problem.
But how do you automatically have an numeric array after parsing lines like:
123 | 456 | 6.4|asdd
124 | 457 | 6.8|asdd
result should be something like:
[123 456 6.4 NA; 124 457 6.8 NA; ]
Rob.
Hi Rob,
My initial thought was that you should use textscan for your problem. Unfortunately, textscan will stop processing if it encounters data that it can not convert, as opposed to the NaNs that you want.
You can get your result using a couple of splits:
Create some sample data:
>> str = sprintf(’123 | 456 | 6.4|asdd\n124 | 457 | 6.8|asdd\n125 | 458 | 7.0|qwerty’)
str =
123 | 456 | 6.4|asdd
124 | 457 | 6.8|asdd
125 | 458 | 7.0|qwerty
Split the data on newlines to find the rows:
>> rows = regexp(str, ‘\n’, ’split’)
rows =
[1×20 char] [1×20 char] [1×22 char]
Now split the rows into individual cells:
>> cells = regexp(rows, ‘\s*\|\s*’, ’split’)
cells =
{1×4 cell} {1×4 cell} {1×4 cell}
Since cells is now a 1×3 cell of 1×4 cells, vertically concatenate them to recreate the original shape of the data:
>> cells = vertcat(cells{:})
cells =
‘123′ ‘456′ ‘6.4′ ‘asdd’
‘124′ ‘457′ ‘6.8′ ‘asdd’
‘125′ ‘458′ ‘7.0′ ‘qwerty’
Now str2double will do the rest of the work:
>> str2double(cells)
ans =
123.0000 456.0000 6.4000 NaN
124.0000 457.0000 6.8000 NaN
125.0000 458.0000 7.0000 NaN
-=>J
The flavour of regular expressions, which is implemented in Matlab, is that identical (or close) to a flavour that one can find i a book like “Mastering Regular Expressions” or supported by a tool like RegexBuddy, http://www.regexbuddy.com/?
MATLAB’s regular expressions are very similar to what you will find described in “Mastering Regular Expressions”. RegexBuddy supports many flavors, the most similar to MATLAB being Perl and Java. I highly recommend the “Mastering Regular Expressions” book. The book mentions what features and syntaxes are flavor specific, and often presents tables of the various languages and their support of a specific feature. MATLAB’s syntax will generally be one of those. The details of MATLAB’s regular expressions are found here: http://www.mathworks.com/access/helpdesk/help/techdoc/index.html?/access/helpdesk/help/techdoc/matlab_prog/f0-42649.html
-=>J
strread
Please, enlighten me as to eleganlty peform the following:
I have an ascii text file (notepad) with lines of values with the following format.
first= 320 second= 234566 third= -0.24 …
first= 788 second= 256667 third= -0.24 …
and so on.
I anticipated using code like the following to create a matrix of the numeric values in my text file that match my patterns.
fid = fopen(’accuSUMok.txt’);
i = 1;
while 1
tline = fgetl(fid);
pat = ‘ first=\s+\d+’;
a = regexp(tline,pat,’match’);
disp(a);
pat = ‘\d+’;
c = regexp(a,pat,’match’);
disp(c);
pat = ‘ third=\s+\-0\.\d+’;
b = regexp(tline,pat,’match’);
disp(b);
pat =’\-0\.\d+’;
d = regexp(b,pat,’match’);
disp(d);
j = [ a c b d]
%printf(’%s %s’, a, b, c, d);
if ~ischar(tline), break, end
end
fclose(fid);
But what I see is, many lines of, with the error at end of file.:
‘ first= 325′
{1×1 cell}
‘ third= -0.36′
{1×1 cell}
??? Undefined function or method ‘regexp’ for input arguments of type ‘double’.
Error in ==> reg at 7
a = regexp(tline,pat,’match’);
What can I do to elegantly put the numeric values into an array?
tia
simplifying:
fid = fopen(’ascii.txt’);
while 1
tline = fgetl(fid);
pat = ‘ first=\s+(\d+)’;
a = regexp(tline,pat,’match’);
a(:)
pat = ‘\d+’;
c = regexp(a,pat,’match’);
c(:)
if ~ischar(tline), break, end
end
fclose(fid);
~~~~~~~~~~~~~~
produces
ans =
‘ first= 325′
ans =
{1×1 cell}
ans =
‘ first= 325′
ans =
{1×1 cell}
??? Undefined function or method ‘regexp’ for input arguments of type ‘double’.
Error in ==> reg at 7
a = regexp(tline,pat,’match’);
What’s the appropriate, elegant way to get at the numeric value?
Here’s the corrected link for tokens.
–Loren
Hi JB,
You need to modify your end-of-file check to be at the beginning of your while loop, not at the end.
The error you are getting is because you are passing -1 (end-of-file) to regexp.
-=>J