{"id":557,"date":"2012-10-18T12:46:42","date_gmt":"2012-10-18T17:46:42","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=557"},"modified":"2012-10-01T12:48:22","modified_gmt":"2012-10-01T17:48:22","slug":"learning-to-love-regular-expressions","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2012\/10\/18\/learning-to-love-regular-expressions\/","title":{"rendered":"Learning to Love Regular Expressions"},"content":{"rendered":"<!DOCTYPE html\r\n  PUBLIC \"-\/\/W3C\/\/DTD HTML 4.01 Transitional\/\/EN\">\r\n<style type=\"text\/css\">\r\n\r\nh1 { font-size:18pt; }\r\nh2.titlebg { font-size:13pt; }\r\nh3 { color:#4A4F55; padding:0px; margin:5px 0px 5px; font-family:Arial, Helvetica, sans-serif; font-size:11pt; font-weight:bold; line-height:140%; border-bottom:1px solid #d6d4d4; display:block; }\r\nh4 { color:#4A4F55; padding:0px; margin:0px 0px 5px; font-family:Arial, Helvetica, sans-serif; font-size:10pt; font-weight:bold; line-height:140%; border-bottom:1px solid #d6d4d4; display:block; }\r\n   \r\np { padding:0px; margin:0px 0px 20px; }\r\nimg { padding:0px; margin:0px 0px 20px; border:none; }\r\np img, pre img, tt img, li img { margin-bottom:0px; } \r\n\r\nul { padding:0px; margin:0px 0px 20px 23px; list-style:square; }\r\nul li { padding:0px; margin:0px 0px 7px 0px; background:none; }\r\nul li ul { padding:5px 0px 0px; margin:0px 0px 7px 23px; }\r\nul li ol li { list-style:decimal; }\r\nol { padding:0px; margin:0px 0px 20px 0px; list-style:decimal; }\r\nol li { padding:0px; margin:0px 0px 7px 23px; list-style-type:decimal; }\r\nol li ol { padding:5px 0px 0px; margin:0px 0px 7px 0px; }\r\nol li ol li { list-style-type:lower-alpha; }\r\nol li ul { padding-top:7px; }\r\nol li ul li { list-style:square; }\r\n\r\npre, tt, code { font-size:12px; }\r\npre { margin:0px 0px 20px; }\r\npre.error { color:red; }\r\npre.codeinput { padding:10px; border:1px solid #d3d3d3; background:#f7f7f7; }\r\npre.codeoutput { padding:10px 11px; margin:0px 0px 20px; color:#4c4c4c; }\r\n\r\n@media print { pre.codeinput, pre.codeoutput { word-wrap:break-word; width:100%; } }\r\n\r\nspan.keyword { color:#0000FF }\r\nspan.comment { color:#228B22 }\r\nspan.string { color:#A020F0 }\r\nspan.untermstring { color:#B20000 }\r\nspan.syscmd { color:#B28C00 }\r\n\r\n.footer { width:auto; padding:10px 0px; margin:25px 0px 0px; border-top:1px dotted #878787; font-size:0.8em; line-height:140%; font-style:italic; color:#878787; text-align:left; float:none; }\r\n.footer p { margin:0px; }\r\n\r\n  <\/style><div class=\"content\"><!--introduction--><p>Today I&#8217;d like to introduce guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team here at MathWorks. Sarah previously has <a href=\"https:\/\/blogs.mathworks.com\/loren\/2012\/02\/06\/using-gpus-in-matlab\/\">written<\/a> about using GPUs in MATLAB. Sarah will be discussing how she got started using regular expressions.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#191c5ebc-2786-4e8a-a574-1032c9c4e871\">Overview<\/a><\/li><li><a href=\"#744cb18f-6ecd-4a1f-8fff-769d379aeae1\">The Basics<\/a><\/li><li><a href=\"#89c29ef6-d7ec-4f57-a80c-702aec223965\">Example #1 - Splitting a String into Separate Words<\/a><\/li><li><a href=\"#97ef1ce0-20e4-42e4-8135-56e29458dc2e\">Example #2 - Creating Short Labels for a Plot<\/a><\/li><li><a href=\"#fd114ae2-2e1a-4bf3-8298-0aa2d4f6c2de\">Example #3 - Finding Data Marked by Repeat Letters<\/a><\/li><li><a href=\"#9e8db6bd-4517-47fe-b8bb-25279826fc1b\">Conclusion<\/a><\/li><\/ul><\/div><h4>Overview<a name=\"191c5ebc-2786-4e8a-a574-1032c9c4e871\"><\/a><\/h4><p>Over the past few years, I have had the honor of doing several guest blog posts for Loren. Usually, I am blogging about something that I know quite well - but this time it is different.  I wanted to write about regular expressions, talk a little about how I am starting to use them, and show some examples that I created along the way.  My background is in computational geophysics, so I am pretty comfortable with numbers, parallel computing and a whole bunch of other MATLAB stuff. But, I never had to really manipulate strings.  In my minor working with strings, I found that functions like <tt>strfind<\/tt> were enough for me to get the proverbial job done.<\/p><p>Well, then I found <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/cody\/\">Cody<\/a>.  If you haven't started using Cody - you might not understand how addictive it can be!  All those cool little coding puzzles in MATLAB, I just couldn't stop.  However, I found myself pretty consistently skipping over any challenge that had to do with manipulating strings. I figured this was a sign that I had a hole in my MATLAB skills, and I needed to start remedying it.  So, I started on a quest to learn more about regular expressions. The more I learn and play with them, the more I am impressed with just how powerful they are.<\/p><p>If you are new to regular expressions, I hope this blog post will inspire you to start embracing them as well.  If you are an experienced regular expression user, hopefully you will enjoy some of my examples and find my newly budding excitement about regular expression amusing. You might want to check out <a href=\"https:\/\/blogs.mathworks.com\/loren\/2006\/04\/05\/regexp-how-tos\/\">this<\/a> guest post made by one of our developers, Jason Breslau, which discusses the differences between the Perl and MATLAB implementations of regular expressions. Also, please consider posting your favorite examples in the comments at end of the post.<\/p><h4>The Basics<a name=\"744cb18f-6ecd-4a1f-8fff-769d379aeae1\"><\/a><\/h4><p>Regular expressions are a way to describe a pattern within text. With regular expressions, you can match or alter parts (substrings) of a text string that match the described pattern. Regular expressions are found in text editors and in a range of languages including Perl, Java, Ruby, and of course, MATLAB.<\/p><p>In this post, I am going to focus on the function <tt>regexp<\/tt>.  There are several other regular expression related functions in MATLAB, so I encourage you to read more about <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#bsgto96-1\">them<\/a> as well.<\/p><p>In MATLAB, the calling syntax for <tt>regexp<\/tt> that we will be using is:<\/p><p><tt>[selected_outputs] = regexp(string,expr,outselect)<\/tt>.<\/p><p><tt>string<\/tt> is the string or the cell array of strings that I want to search for the pattern.  <tt>expr<\/tt> is the regular expression that specifies the pattern I want to match. <tt>outselect<\/tt> specifies the output I want from the function, including such options as the location of the start or end of the substring that matches the expression, and the text of the substring of the input string that matches the pattern. All the possible output options are explained in more detail in the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html#bsyicm1-5\">documentation<\/a>.<\/p><p>Enough background, let's look at three examples of places where I have been using regular expressions lately.<\/p><h4>Example #1 - Splitting a String into Separate Words<a name=\"89c29ef6-d7ec-4f57-a80c-702aec223965\"><\/a><\/h4><p>This example is so deceptively easy - but is so totally useful, I just had to include it. Have you ever wanted to split a string into separate words? This does it for you in one easy step.<\/p><p>Let's go through the basic syntax. First, I need to define the expression to specify the pattern I want to match in the string. In this case, I want to find the spaces - so I choose <tt>'\\s'<\/tt> which represents any white space character.  It is what is known as a \"character type\", and it represents either a specific set of characters or a certain type of character.  The documentation has a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#f0-42723\">list<\/a> of possible character types available to use with <tt>regexp<\/tt>.<\/p><p>Then I have to decide what I want as outputs from <tt>regexp<\/tt>. In this case, I pick <tt>split<\/tt>, which indicates that I want to split the string into parts determined by the substring that matches the expression (i.e., break up the string into substrings based on where there is a space).<\/p><pre class=\"codeinput\">mystring = <span class=\"string\">'My name is Sarah Wait Zaranek'<\/span>;\r\nsplitstring = regexp(mystring,<span class=\"string\">'\\s'<\/span>,<span class=\"string\">'split'<\/span>);\r\ndisp(splitstring)\r\n<\/pre><pre class=\"codeoutput\">    'My'    'name'    'is'    'Sarah'    'Wait'    'Zaranek'\r\n<\/pre><p>The initial string is now broken up into a cell array containing all the separate words in the sentence.<\/p><p>I could do a similar thing if I wanted to break up a string based on sentences. In this case, I want to split at a space immediately proceeded by a !, ., or ?. This is slightly more complicated, and I might want to use a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#brchk1t\">lookaround<\/a> operator.<\/p><p>First, I have to figure out how to let <tt>regexp<\/tt> know that I want to match from a set of possible characters; I do this by enclosing the possible matches in square brackets. <tt>[!.?]<\/tt> means match any one of the character's listed.<\/p><p>Second, I indicate that I want to match a space preceded by any of those characters. To do so I use a lookaround operator.  Lookaround operators let you look around a current position in a string. For instance, to look ahead (to the right) of a position to test if a particular expression is found, you use the lookahead operator <tt>(?=expr)<\/tt>. In this case, I use a lookbehind operator <tt>(?&lt;=expr)<\/tt> which allows me to look behind a current position to test if an expression is found. In particular, I am looking for matches where I find a space and when I \"look behind\" the space (to the left of it) I find a !, . and ?.  I, again, can use the <tt>split<\/tt> output option to split by the matched substring.<\/p><pre class=\"codeinput\">mystring = <span class=\"string\">'My name is Sarah. I love MATLAB! Do you?'<\/span>;\r\nsplitstring = regexp(mystring,<span class=\"string\">'(?&lt;=[!.?])\\s'<\/span>,<span class=\"string\">'split'<\/span>);\r\ndisp(splitstring)\r\n<\/pre><pre class=\"codeoutput\">    'My name is Sarah.'    'I love MATLAB!'    'Do you?'\r\n<\/pre><h4>Example #2 - Creating Short Labels for a Plot<a name=\"97ef1ce0-20e4-42e4-8135-56e29458dc2e\"><\/a><\/h4><p>In this example, I was working on a problem that involved data from several locations in California. I wanted to plot data from several locations on the same plot and label each set of data accordingly. However, since the city names were long and there was a lot of data in my plot, I wanted to create abbreviated city names to do my labeling.<\/p><p>My first attempt was to do the command listed below. First, I want to find the locations of the capital letters in the cell array of strings. To do this, I can use <tt>[A-Z]<\/tt> character range operator which allows me to specify any character within the range of capital A to capital Z (aka any capital letter). The default output of <tt>regexp<\/tt> with a single output variable is to give me the position of the start of the match string, and I use that here. I can then use these locations to create my abbreviated city names by taking the capital letter and one letter to the right of it to create my abbreviation.<\/p><p><tt>regexp<\/tt> returns a 1 x 6 cell array, each element holding the location of the capital letters for the corresponding input strings.<\/p><p>To extract the capital letters and the letters next to them, I use <tt>cellfun<\/tt> to operate on each element of the output indices and input string. I use <tt>sort<\/tt> to sort the indices into monotonically increasing order. This method assumes I have no capital letters in a row.<\/p><pre class=\"codeinput\">locationNames = {<span class=\"string\">'Bennett Valley'<\/span> , <span class=\"string\">'Bishop'<\/span> , <span class=\"string\">'Camino'<\/span>, <span class=\"string\">'Santa Rosa'<\/span>, <span class=\"keyword\">...<\/span>\r\n                   <span class=\"string\">'U.C. Riverside'<\/span>, <span class=\"string\">'Windsor'<\/span>};\r\n\r\nidx = regexp(locationNames, <span class=\"string\">'[A-Z]'<\/span>);\r\n\r\nshortLabels = cellfun(@(label,idx) label(sort([idx idx+1])),<span class=\"keyword\">...<\/span>\r\n    locationNames,idx,<span class=\"string\">'UniformOutput'<\/span>,false);\r\n\r\ndisp(shortLabels)\r\n<\/pre><pre class=\"codeoutput\">    'BeVa'    'Bi'    'Ca'    'SaRo'    'U.C.Ri'    'Wi'\r\n<\/pre><p>When I learned more about regular expressions, I discovered a new and cleaner way to accomplish the same task. I can extend the pattern to be any capital letter followed by any character.  A dot (<tt>.<\/tt>) is used to represent any single character.<\/p><p>Since this actually matches the substrings I am interested in extracting, I don't need to output the indices.  Instead, I just indicate I want the matched substrings by specifying <tt>'match'<\/tt> as my output option. I, then, concatenate the matched substrings from each city name into a single abbreviation for that city by using <tt>cellfun<\/tt>.<\/p><pre class=\"codeinput\">shortLabels2 = regexp(locationNames, <span class=\"string\">'[A-Z].'<\/span>, <span class=\"string\">'match'<\/span>);\r\n\r\nshortLabelsFinal = cellfun(@(x) [x{:}], shortLabels2, <span class=\"string\">'UniformOutput'<\/span>,false);\r\ndisp(shortLabelsFinal)\r\n<\/pre><pre class=\"codeoutput\">    'BeVa'    'Bi'    'Ca'    'SaRo'    'U.C.Ri'    'Wi'\r\n<\/pre><h4>Example #3 - Finding Data Marked by Repeat Letters<a name=\"fd114ae2-2e1a-4bf3-8298-0aa2d4f6c2de\"><\/a><\/h4><p>In this case, I am working with text strings that look something like - 'CC=0\/CT=1\/TT=5375'. I wanted to extract out the numeric values that follow a repeated letter.  This is actually genetic data, and in this particular string I want to extract the number of people associated with either the CC or TT <a href=\"http:\/\/en.wikipedia.org\/wiki\/Genotype\">genotype<\/a>.<\/p><p>There are a couple ways to approach this. Since there are only a few letters that could be present (A,G,C,T), I could use | as an alternative match operator.  Unlike using <tt>[]<\/tt> as above in Example 1, | allows me to combine multiple expressions as the possible alternatives to match. If you enclose these alternatives with the square brackets, <tt>[]<\/tt>, it will take each character including the | as the list of possible characters to match, so be sure to enclose with parentheses, <tt>()<\/tt>.<\/p><p>This difference is explained nicely in Friedl's <i>Mastering Regular Expressions<\/i>. Note: he uses character class to refer to using <tt>[]<\/tt>. He states, \"Don't confuse the alternation with a character class...A character class can match exactly one character, and that's true no matter how long or short the specified list of acceptable characters might be. Alternation, on the other hand, can have arbitrarily long alternatives, each textually unrelated to the other\".<\/p><p>I first create an expression which represents the possible double letters followed by an equals sign, <tt>(CC|TT|AA|GG)=<\/tt>.  Then, I place this expression in a lookbehind operator because I want to find the numbers immediately proceeded by this pattern, <tt>(?&lt;=(CC|TT|AA|GG)=)<\/tt>.<\/p><p>I use <tt>\\d+<\/tt> to define the number I want to match. This is made up of the metacharacter <tt>\\d<\/tt> which represents any numeric digit and the <tt>+<\/tt> which is a quantifier. <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#f0-43073\">Quantifiers<\/a> are used to match consecutive occurrences of a pattern with a string. In this case, <tt>+<\/tt> means match the pattern one or more times.<\/p><p>When these pieces are put together, They make an expression that represents a number that is preceded by a double CC, TT, AA, or GG and an equals sign, <tt>(?&lt;=(CC|TT|AA|GG)=)\\d+<\/tt>.<\/p><pre class=\"codeinput\">geneString = <span class=\"string\">'CC=0\/CT=1\/TT=5375'<\/span>;\r\n\r\ndoubleValues1 = regexp(geneString,<span class=\"string\">'(?&lt;=(CC|TT|AA|GG)=)\\d+'<\/span>,<span class=\"string\">'match'<\/span>);\r\ndisp(doubleValues1)\r\n<\/pre><pre class=\"codeoutput\">    '0'    '5375'\r\n<\/pre><p>Alternatively, I could use <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#f0-56360\">tokens<\/a> to find where letters were repeated. Although in this case since I have a relatively short list of possible repeated letters, tokens might be a bit heavy-handed. However, tokens can help me find any repeated letter and are useful if I don't want to write out all possible double letter combinations.<\/p><p>Parentheses allow you to group multiple characters and designate matched expressions found as tokens. Tokens allow you to remember matched elements and allow you to match other parts of the string with these captured elements.  Using <tt>\\N<\/tt>, I can reference the Nth matched token in my expression.<\/p><p>In this case, I want to find the letters in the string and see if the next letter matches the previous letter found. I do this by using \\w to find a letter and grouping it in parentheses to make it a token. Then, I reference that token by a \\1 to indicate we want to find instances where the matched letter was repeated. I follow that by an <tt>=<\/tt> and a <tt>\\d+<\/tt> as before. I capture the numeric value as output by exporting just the tokens and by indicating I want <tt>\\d+<\/tt> to be a token by enclosing it in parentheses.  By choosing tokens as the output option, I can just get the matched tokens and not the whole matched string. I can then use <tt>cellfun<\/tt> to extract the 2nd token (the numeric values) from the output.<\/p><pre class=\"codeinput\">doubleValues2 = regexp(geneString,<span class=\"string\">'(\\w)\\1=(\\d+)'<\/span>,<span class=\"string\">'tokens'<\/span>);\r\ncelldisp(doubleValues2);\r\n<\/pre><pre class=\"codeoutput\">doubleValues2{1}{1} =\r\nC\r\ndoubleValues2{1}{2} =\r\n0\r\ndoubleValues2{2}{1} =\r\nT\r\ndoubleValues2{2}{2} =\r\n5375\r\n<\/pre><pre class=\"codeinput\">doubleValues2 = cellfun(@(x) x{2}, doubleValues2,<span class=\"string\">'UniformOutput'<\/span>,false);\r\ndisp(doubleValues2);\r\n<\/pre><pre class=\"codeoutput\">    '0'    '5375'\r\n<\/pre><h4>Conclusion<a name=\"9e8db6bd-4517-47fe-b8bb-25279826fc1b\"><\/a><\/h4><p>I hope you enjoyed this post on regular expressions.  This is only the tip of the iceberg, and there is much more that regular expressions can do.  Check out the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html\">documentation<\/a> for more examples, and have fun!<\/p><p>If you are currently using regular expressions, do you have any advice for those new to using regular expressions?  If you are new to using regular expressions, do you have any questions on getting started? Let me know by leaving a comment for <a href=\"https:\/\/blogs.mathworks.com\/loren\/?=p=557#respond\">this post<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_606058981b45441eaad3613d86b90625() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='606058981b45441eaad3613d86b90625 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 606058981b45441eaad3613d86b90625';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_606058981b45441eaad3613d86b90625()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2012b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2012b<br><\/p><\/div><!--\r\n606058981b45441eaad3613d86b90625 ##### SOURCE BEGIN #####\r\n%% Learning to Love Regular Expressions \r\n% Today I\u00e2\u20ac\u2122d like to introduce guest blogger\r\n% <mailto:sarah.zaranek@mathworks.com Sarah Wait Zaranek> who works for the\r\n% MATLAB Marketing team here at MathWorks. Sarah previously has\r\n% <https:\/\/blogs.mathworks.com\/loren\/2012\/02\/06\/using-gpus-in-matlab\/ written>\r\n% about using GPUs in MATLAB. Sarah will be discussing how she got started\r\n% using regular expressions.\r\n\r\n%% Overview  \r\n% Over the past few years, I have had the honor of doing several guest blog\r\n% posts for Loren. Usually, I am blogging about something that I know quite\r\n% well - but this time it is different.  I wanted to write about regular\r\n% expressions, talk a little about how I am starting to use them, and show\r\n% some examples that I created along the way.  My background is in\r\n% computational geophysics, so I am pretty comfortable with numbers, parallel\r\n% computing and a whole bunch of other MATLAB stuff. But, I never had to\r\n% really manipulate strings.  In my minor working with strings, I found\r\n% that functions like |strfind| were enough for me to get the proverbial\r\n% job done.\r\n%\r\n% Well, then I found <https:\/\/www.mathworks.com\/matlabcentral\/cody\/\r\n% Cody>.  If you haven't started using Cody - you might not understand how\r\n% addictive it can be!  All those cool little coding puzzles in MATLAB, I\r\n% just couldn't stop.  However, I found myself pretty consistently skipping\r\n% over any challenge that had to do with manipulating strings. I figured\r\n% this was a sign that I had a hole in my MATLAB skills, and I needed to\r\n% start remedying it.  So, I started on a quest to learn more about\r\n% regular expressions. The more I learn and play with them, the more I am\r\n% impressed with just how powerful they are.\r\n%\r\n% If you are new to regular expressions, I hope this blog post will inspire\r\n% you to start embracing them as well.  If you are an experienced regular\r\n% expression user, hopefully you will enjoy some of my examples and find my\r\n% newly budding excitement about regular expression amusing. You might\r\n% want to check out <https:\/\/blogs.mathworks.com\/loren\/2006\/04\/05\/regexp-how-tos\/ this> \r\n% guest post made by one of our developers, Jason Breslau, which discusses \r\n% the differences between the Perl and MATLAB implementations of regular\r\n% expressions. Also, please consider posting your favorite examples in the\r\n% comments at end of the post.\r\n\r\n%% The Basics\r\n% Regular expressions are a way to describe a pattern within text. With\r\n% regular expressions, you can match or alter parts (substrings) of a text\r\n% string that match the described pattern. Regular expressions are found in\r\n% text editors and in a range of languages including Perl, Java, Ruby, and\r\n% of course, MATLAB.\r\n%\r\n% In this post, I am going to focus on the function |regexp|.  There\r\n% are several other regular expression related functions in MATLAB, so I\r\n% encourage you to read more about\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#bsgto96-1 them>\r\n% as well.\r\n%\r\n% In MATLAB, the calling syntax for |regexp| that we will be using is:\r\n%\r\n% |[selected_outputs] = regexp(string,expr,outselect)|.  \r\n%\r\n% |string| is the string or the cell array of strings that I want to\r\n% search for the pattern.  |expr| is the regular expression that specifies\r\n% the pattern I want to match. |outselect| specifies the output I want\r\n% from the function, including such options as the location of the start or\r\n% end of the substring that matches the expression, and the text of the\r\n% substring of the input string that matches the pattern. All the possible\r\n% output options are explained in more detail in the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html#bsyicm1-5\r\n% documentation>.\r\n\r\n%%\r\n% Enough background, let's look at three examples of places where I have\r\n% been using regular expressions lately.\r\n\r\n%% Example #1 - Splitting a String into Separate Words\r\n%\r\n% This example is so deceptively easy - but is so totally useful, I just\r\n% had to include it. Have you ever wanted to split a string into separate\r\n% words? This does it for you in one easy step.\r\n%\r\n% Let's go through the basic syntax. First, I need to define the expression\r\n% to specify the pattern I want to match in the string. In this case, I\r\n% want to find the spaces - so I choose |'\\s'| which represents any white\r\n% space character.  It is what is known as a \"character type\", and it\r\n% represents either a specific set of characters or a certain type of\r\n% character.  The documentation has a\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#f0-42723 list>\r\n% of possible character types available to use with |regexp|.\r\n%\r\n% Then I have to decide what I want as outputs from |regexp|. In this case,\r\n% I pick |split|, which indicates that I want to split the string into parts\r\n% determined by the substring that matches the expression (i.e., break up the\r\n% string into substrings based on where there is a space).\r\n\r\nmystring = 'My name is Sarah Wait Zaranek';\r\nsplitstring = regexp(mystring,'\\s','split');\r\ndisp(splitstring)\r\n\r\n%%\r\n%\r\n% The initial string is now broken up into a cell array containing all the\r\n% separate words in the sentence.\r\n\r\n%%\r\n% I could do a similar thing if I wanted to break up a string based on\r\n% sentences. In this case, I want to split at a space immediately proceeded\r\n% by a !, ., or ?. This is slightly more complicated, and I might want to\r\n% use a\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#brchk1t lookaround> \r\n% operator.\r\n%\r\n% First, I have to figure out how to let |regexp| know that I want to match\r\n% from a set of possible characters; I do this by enclosing the possible matches\r\n% in square brackets. |[!.?]| means match any one of the character's\r\n% listed.  \r\n%\r\n% Second, I indicate that I want to match a space preceded by any of those\r\n% characters. To do so I use a lookaround operator.  Lookaround operators\r\n% let you look around a current position in a string. For instance, to look\r\n% ahead (to the right) of a position to test if a particular expression is\r\n% found, you use the lookahead operator |(?=expr)|. In this case, I use a\r\n% lookbehind operator |(?<=expr)| which allows me to look behind a current\r\n% position to test if an expression is found. In particular, I am looking\r\n% for matches where I find a space and when I \"look behind\" the space (to\r\n% the left of it) I find a !, . and ?.  I, again, can use the |split|\r\n% output option to split by the matched substring.\r\n\r\nmystring = 'My name is Sarah. I love MATLAB! Do you?';\r\nsplitstring = regexp(mystring,'(?<=[!.?])\\s','split');\r\ndisp(splitstring)\r\n\r\n%% Example #2 - Creating Short Labels for a Plot\r\n%\r\n% In this example, I was working on a problem that involved data from\r\n% several locations in California. I wanted to plot data from several\r\n% locations on the same plot and label each set of data accordingly.\r\n% However, since the city names were long and there was a lot of data in my\r\n% plot, I wanted to create abbreviated city names to do my labeling. \r\n\r\n%% \r\n% My first attempt was to do the command listed below. First, I want to find\r\n% the locations of the capital letters in the cell array of strings. To do\r\n% this, I can use |[A-Z]| character range operator which allows me to\r\n% specify any character within the range of capital A to capital Z (aka\r\n% any capital letter). The default output of |regexp| with a single output\r\n% variable is to give me the position of the start of the match string, and\r\n% I use that here. I can then use these locations to create my abbreviated\r\n% city names by taking the capital letter and one letter to the right of it\r\n% to create my abbreviation. \r\n%\r\n% |regexp| returns a 1 x 6 cell array, each element holding the location of\r\n% the capital letters for the corresponding input strings. \r\n%\r\n% To extract the capital letters and the letters next to them, I use\r\n% |cellfun| to operate on each element of the output indices and input\r\n% string. I use |sort| to sort the indices into monotonically increasing\r\n% order. This method assumes I have no capital letters in a row.\r\n\r\nlocationNames = {'Bennett Valley' , 'Bishop' , 'Camino', 'Santa Rosa', ...\r\n                   'U.C. Riverside', 'Windsor'};\r\n\r\nidx = regexp(locationNames, '[A-Z]');\r\n\r\nshortLabels = cellfun(@(label,idx) label(sort([idx idx+1])),...\r\n    locationNames,idx,'UniformOutput',false);\r\n\r\ndisp(shortLabels)\r\n\r\n                         \r\n%%\r\n% When I learned more about regular expressions, I discovered a\r\n% new and cleaner way to accomplish the same task. I can extend the pattern\r\n% to be any capital letter followed by any character.  A dot (|.|) is used to\r\n% represent any single character.\r\n%\r\n% Since this actually matches the substrings I am interested in extracting,\r\n% I don't need to output the indices.  Instead, I just indicate I want the\r\n% matched substrings by specifying |'match'| as my output option. I, then,\r\n% concatenate the matched substrings from each city name into a single\r\n% abbreviation for that city by using |cellfun|.\r\n\r\nshortLabels2 = regexp(locationNames, '[A-Z].', 'match');\r\n\r\nshortLabelsFinal = cellfun(@(x) [x{:}], shortLabels2, 'UniformOutput',false);\r\ndisp(shortLabelsFinal)\r\n\r\n\r\n%% Example #3 - Finding Data Marked by Repeat Letters\r\n% In this case, I am working with text strings that look something like -\r\n% 'CC=0\/CT=1\/TT=5375'. I wanted to extract out the numeric values that\r\n% follow a repeated letter.  This is actually genetic data, and in this\r\n% particular string I want to extract the number of people associated with\r\n% either the CC or TT <http:\/\/en.wikipedia.org\/wiki\/Genotype genotype>. \r\n\r\n%%\r\n% There are a couple ways to approach this. Since there are only a few\r\n% letters that could be present (A,G,C,T), I could use | as an alternative\r\n% match operator.  Unlike using |[]| as above in Example 1, ||| allows me\r\n% to combine multiple expressions as the possible alternatives to match. If\r\n% you enclose these alternatives with the square brackets, |[]|, it will\r\n% take each character including the ||| as the list of possible characters\r\n% to match, so be sure to enclose with parentheses, |()|.\r\n%\r\n% This difference is explained nicely in Friedl's _Mastering Regular\r\n% Expressions_. Note: he uses character class to refer to using |[]|. \r\n% He states, \"Don't confuse the alternation with a character class...A\r\n% character class can match exactly one character, and that's true no\r\n% matter how long or short the specified list of acceptable characters\r\n% might be. Alternation, on the other hand, can have arbitrarily long\r\n% alternatives, each textually unrelated to the other\".\r\n% \r\n% I first create an expression which represents the possible double letters\r\n% followed by an equals sign, |(CC|TT|AA|GG)=|.  Then, I place this\r\n% expression in a lookbehind operator because I want to find the numbers\r\n% immediately proceeded by this pattern, |(?<=(CC|TT|AA|GG)=)|.\r\n%\r\n% I use |\\d+| to define the number I want to match. This is made up of the\r\n% metacharacter |\\d| which represents any numeric digit and the |+| which\r\n% is a quantifier.\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#f0-43073 Quantifiers> \r\n% are used to match consecutive occurrences of a pattern with a string.\r\n% In this case, |+| means match the pattern one or more times.\r\n%\r\n% When these pieces are put together, They make an expression that\r\n% represents a number that is preceded by a double CC, TT, AA, or GG and\r\n% an equals sign, |(?<=(CC|TT|AA|GG)=)\\d+|. \r\n\r\n%%\r\ngeneString = 'CC=0\/CT=1\/TT=5375';\r\n\r\ndoubleValues1 = regexp(geneString,'(?<=(CC|TT|AA|GG)=)\\d+','match');\r\ndisp(doubleValues1)\r\n\r\n%%\r\n% Alternatively, I could use <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html#f0-56360 tokens> \r\n% to find where letters were repeated. Although in this case since I have a\r\n% relatively short list of possible repeated letters, tokens might be a bit\r\n% heavy-handed. However, tokens can help me find any repeated letter and\r\n% are useful if I don't want to write out all possible double letter\r\n% combinations.\r\n%\r\n% Parentheses allow you to group multiple characters and designate matched\r\n% expressions found as tokens. Tokens allow you to remember matched\r\n% elements and allow you to match other parts of the string with these\r\n% captured elements.  Using |\\N|, I can reference the Nth matched token in\r\n% my expression.\r\n%\r\n% In this case, I want to find the letters in the string and\r\n% see if the next letter matches the previous letter found. I do this by\r\n% using \\w to find a letter and grouping it in parentheses to make it a\r\n% token. Then, I reference that token by a \\1 to indicate we want to find\r\n% instances where the matched letter was repeated. I follow that by an |=|\r\n% and a |\\d+| as before. I capture the numeric value as output by exporting\r\n% just the tokens and by indicating I want |\\d+| to be a token by\r\n% enclosing it in parentheses.  By choosing tokens as the output option, I\r\n% can just get the matched tokens and not the whole matched string. I can\r\n% then use |cellfun| to extract the 2nd token (the numeric values) from the\r\n% output.\r\n\r\ndoubleValues2 = regexp(geneString,'(\\w)\\1=(\\d+)','tokens');\r\ncelldisp(doubleValues2);\r\n\r\n%%\r\ndoubleValues2 = cellfun(@(x) x{2}, doubleValues2,'UniformOutput',false);\r\ndisp(doubleValues2);\r\n\r\n%% Conclusion\r\n% I hope you enjoyed this post on regular expressions.  This is only the\r\n% tip of the iceberg, and there is much more that regular expressions can\r\n% do.  Check out the <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html documentation>\r\n% for more examples, and have fun!\r\n%\r\n% If you are currently using regular expressions, do you have any advice\r\n% for those new to using regular expressions?  If you are new to using\r\n% regular expressions, do you have any questions on getting started? Let me\r\n% know by leaving a comment for\r\n% <https:\/\/blogs.mathworks.com\/loren\/?=p=557#respond this post>.\r\n\r\n##### SOURCE END ##### 606058981b45441eaad3613d86b90625\r\n-->","protected":false},"excerpt":{"rendered":"<!--introduction--><p>Today I&#8217;d like to introduce guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team here at MathWorks. Sarah previously has <a href=\"https:\/\/blogs.mathworks.com\/loren\/2012\/02\/06\/using-gpus-in-matlab\/\">written<\/a> about using GPUs in MATLAB. Sarah will be discussing how she got started using regular expressions.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2012\/10\/18\/learning-to-love-regular-expressions\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[39,15,2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/557"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=557"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/557\/revisions"}],"predecessor-version":[{"id":561,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/557\/revisions\/561"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=557"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=557"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=557"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}