Learning to Love Regular Expressions

Posted by Loren Shure, October 18, 2012

26 views (last 30 days) | 0 Likes | 8 comments

Today I’d like to introduce guest blogger Sarah Wait Zaranek who works for the MATLAB Marketing team here at MathWorks. Sarah previously has written about using GPUs in MATLAB. Sarah will be discussing how she got started using regular expressions.

Overview
The Basics
Example #1 - Splitting a String into Separate Words
Example #2 - Creating Short Labels for a Plot
Example #3 - Finding Data Marked by Repeat Letters
Conclusion

Overview

Over the past few years, I have had the honor of doing several guest blog posts for Loren. Usually, I am blogging about something that I know quite well - but this time it is different. I wanted to write about regular expressions, talk a little about how I am starting to use them, and show some examples that I created along the way. My background is in computational geophysics, so I am pretty comfortable with numbers, parallel computing and a whole bunch of other MATLAB stuff. But, I never had to really manipulate strings. In my minor working with strings, I found that functions like strfind were enough for me to get the proverbial job done.

Well, then I found Cody. If you haven't started using Cody - you might not understand how addictive it can be! All those cool little coding puzzles in MATLAB, I just couldn't stop. However, I found myself pretty consistently skipping over any challenge that had to do with manipulating strings. I figured this was a sign that I had a hole in my MATLAB skills, and I needed to start remedying it. So, I started on a quest to learn more about regular expressions. The more I learn and play with them, the more I am impressed with just how powerful they are.

If you are new to regular expressions, I hope this blog post will inspire you to start embracing them as well. If you are an experienced regular expression user, hopefully you will enjoy some of my examples and find my newly budding excitement about regular expression amusing. You might want to check out this guest post made by one of our developers, Jason Breslau, which discusses the differences between the Perl and MATLAB implementations of regular expressions. Also, please consider posting your favorite examples in the comments at end of the post.

The Basics

Regular expressions are a way to describe a pattern within text. With regular expressions, you can match or alter parts (substrings) of a text string that match the described pattern. Regular expressions are found in text editors and in a range of languages including Perl, Java, Ruby, and of course, MATLAB.

In this post, I am going to focus on the function regexp. There are several other regular expression related functions in MATLAB, so I encourage you to read more about them as well.

In MATLAB, the calling syntax for regexp that we will be using is:

[selected_outputs] = regexp(string,expr,outselect).

string is the string or the cell array of strings that I want to search for the pattern. expr is the regular expression that specifies the pattern I want to match. outselect specifies the output I want from the function, including such options as the location of the start or end of the substring that matches the expression, and the text of the substring of the input string that matches the pattern. All the possible output options are explained in more detail in the documentation.

Enough background, let's look at three examples of places where I have been using regular expressions lately.

Example #1 - Splitting a String into Separate Words

This example is so deceptively easy - but is so totally useful, I just had to include it. Have you ever wanted to split a string into separate words? This does it for you in one easy step.

Let's go through the basic syntax. First, I need to define the expression to specify the pattern I want to match in the string. In this case, I want to find the spaces - so I choose '\s' which represents any white space character. It is what is known as a "character type", and it represents either a specific set of characters or a certain type of character. The documentation has a list of possible character types available to use with regexp.

Then I have to decide what I want as outputs from regexp. In this case, I pick split, which indicates that I want to split the string into parts determined by the substring that matches the expression (i.e., break up the string into substrings based on where there is a space).

mystring = 'My name is Sarah Wait Zaranek';
splitstring = regexp(mystring,'\s','split');
disp(splitstring)

    'My'    'name'    'is'    'Sarah'    'Wait'    'Zaranek'

The initial string is now broken up into a cell array containing all the separate words in the sentence.

I could do a similar thing if I wanted to break up a string based on sentences. In this case, I want to split at a space immediately proceeded by a !, ., or ?. This is slightly more complicated, and I might want to use a lookaround operator.

First, I have to figure out how to let regexp know that I want to match from a set of possible characters; I do this by enclosing the possible matches in square brackets. [!.?] means match any one of the character's listed.

Second, I indicate that I want to match a space preceded by any of those characters. To do so I use a lookaround operator. Lookaround operators let you look around a current position in a string. For instance, to look ahead (to the right) of a position to test if a particular expression is found, you use the lookahead operator (?=expr). In this case, I use a lookbehind operator (?<=expr) which allows me to look behind a current position to test if an expression is found. In particular, I am looking for matches where I find a space and when I "look behind" the space (to the left of it) I find a !, . and ?. I, again, can use the split output option to split by the matched substring.

mystring = 'My name is Sarah. I love MATLAB! Do you?';
splitstring = regexp(mystring,'(?<=[!.?])\s','split');
disp(splitstring)

    'My name is Sarah.'    'I love MATLAB!'    'Do you?'

Example #2 - Creating Short Labels for a Plot

In this example, I was working on a problem that involved data from several locations in California. I wanted to plot data from several locations on the same plot and label each set of data accordingly. However, since the city names were long and there was a lot of data in my plot, I wanted to create abbreviated city names to do my labeling.

My first attempt was to do the command listed below. First, I want to find the locations of the capital letters in the cell array of strings. To do this, I can use [A-Z] character range operator which allows me to specify any character within the range of capital A to capital Z (aka any capital letter). The default output of regexp with a single output variable is to give me the position of the start of the match string, and I use that here. I can then use these locations to create my abbreviated city names by taking the capital letter and one letter to the right of it to create my abbreviation.

regexp returns a 1 x 6 cell array, each element holding the location of the capital letters for the corresponding input strings.

To extract the capital letters and the letters next to them, I use cellfun to operate on each element of the output indices and input string. I use sort to sort the indices into monotonically increasing order. This method assumes I have no capital letters in a row.

locationNames = {'Bennett Valley' , 'Bishop' , 'Camino', 'Santa Rosa', ...
                   'U.C. Riverside', 'Windsor'};

idx = regexp(locationNames, '[A-Z]');

shortLabels = cellfun(@(label,idx) label(sort([idx idx+1])),...
    locationNames,idx,'UniformOutput',false);

disp(shortLabels)

    'BeVa'    'Bi'    'Ca'    'SaRo'    'U.C.Ri'    'Wi'

When I learned more about regular expressions, I discovered a new and cleaner way to accomplish the same task. I can extend the pattern to be any capital letter followed by any character. A dot (.) is used to represent any single character.

Since this actually matches the substrings I am interested in extracting, I don't need to output the indices. Instead, I just indicate I want the matched substrings by specifying 'match' as my output option. I, then, concatenate the matched substrings from each city name into a single abbreviation for that city by using cellfun.

shortLabels2 = regexp(locationNames, '[A-Z].', 'match');

shortLabelsFinal = cellfun(@(x) [x{:}], shortLabels2, 'UniformOutput',false);
disp(shortLabelsFinal)

    'BeVa'    'Bi'    'Ca'    'SaRo'    'U.C.Ri'    'Wi'

Example #3 - Finding Data Marked by Repeat Letters

In this case, I am working with text strings that look something like - 'CC=0/CT=1/TT=5375'. I wanted to extract out the numeric values that follow a repeated letter. This is actually genetic data, and in this particular string I want to extract the number of people associated with either the CC or TT genotype.

There are a couple ways to approach this. Since there are only a few letters that could be present (A,G,C,T), I could use | as an alternative match operator. Unlike using [] as above in Example 1, | allows me to combine multiple expressions as the possible alternatives to match. If you enclose these alternatives with the square brackets, [], it will take each character including the | as the list of possible characters to match, so be sure to enclose with parentheses, ().

This difference is explained nicely in Friedl's Mastering Regular Expressions. Note: he uses character class to refer to using []. He states, "Don't confuse the alternation with a character class...A character class can match exactly one character, and that's true no matter how long or short the specified list of acceptable characters might be. Alternation, on the other hand, can have arbitrarily long alternatives, each textually unrelated to the other".

I first create an expression which represents the possible double letters followed by an equals sign, (CC|TT|AA|GG)=. Then, I place this expression in a lookbehind operator because I want to find the numbers immediately proceeded by this pattern, (?<=(CC|TT|AA|GG)=).

I use \d+ to define the number I want to match. This is made up of the metacharacter \d which represents any numeric digit and the + which is a quantifier. Quantifiers are used to match consecutive occurrences of a pattern with a string. In this case, + means match the pattern one or more times.

When these pieces are put together, They make an expression that represents a number that is preceded by a double CC, TT, AA, or GG and an equals sign, (?<=(CC|TT|AA|GG)=)\d+.

geneString = 'CC=0/CT=1/TT=5375';

doubleValues1 = regexp(geneString,'(?<=(CC|TT|AA|GG)=)\d+','match');
disp(doubleValues1)

    '0'    '5375'

Alternatively, I could use tokens to find where letters were repeated. Although in this case since I have a relatively short list of possible repeated letters, tokens might be a bit heavy-handed. However, tokens can help me find any repeated letter and are useful if I don't want to write out all possible double letter combinations.

Parentheses allow you to group multiple characters and designate matched expressions found as tokens. Tokens allow you to remember matched elements and allow you to match other parts of the string with these captured elements. Using \N, I can reference the Nth matched token in my expression.

In this case, I want to find the letters in the string and see if the next letter matches the previous letter found. I do this by using \w to find a letter and grouping it in parentheses to make it a token. Then, I reference that token by a \1 to indicate we want to find instances where the matched letter was repeated. I follow that by an = and a \d+ as before. I capture the numeric value as output by exporting just the tokens and by indicating I want \d+ to be a token by enclosing it in parentheses. By choosing tokens as the output option, I can just get the matched tokens and not the whole matched string. I can then use cellfun to extract the 2nd token (the numeric values) from the output.

doubleValues2 = regexp(geneString,'(\w)\1=(\d+)','tokens');
celldisp(doubleValues2);

doubleValues2{1}{1} =
C
doubleValues2{1}{2} =
0
doubleValues2{2}{1} =
T
doubleValues2{2}{2} =
5375

doubleValues2 = cellfun(@(x) x{2}, doubleValues2,'UniformOutput',false);
disp(doubleValues2);

    '0'    '5375'

Conclusion

I hope you enjoyed this post on regular expressions. This is only the tip of the iceberg, and there is much more that regular expressions can do. Check out the documentation for more examples, and have fun!

If you are currently using regular expressions, do you have any advice for those new to using regular expressions? If you are new to using regular expressions, do you have any questions on getting started? Let me know by leaving a comment for this post.

Published with MATLAB® R2012b