{"id":27,"date":"2006-04-05T00:00:10","date_gmt":"2006-04-05T05:00:10","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=27"},"modified":"2016-07-28T14:20:20","modified_gmt":"2016-07-28T19:20:20","slug":"regexp-how-tos","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2006\/04\/05\/regexp-how-tos\/","title":{"rendered":"MATLAB, Strings, and Regular Expressions"},"content":{"rendered":"<p>\r\nI'm pleased to introduce <a href=\"mailto:tendiamonds@mathworks.com?subject=Your%20post:%20MATLAB,%20Strings%20and%20Regular%20Expressions.\">Jason Breslau<\/a>, our guest blogger this week, who gives us his take on MATLAB, strings, and regular expressions.\r\n\r\n<\/p>\r\n   <p>\r\n    When you think about text processing, you probably think about Perl, as well \r\n    you should. Perl is the de facto standard in text processing, it was created \r\n    for the task, and is fine tuned to make it as easy as possible. To continue \r\n    with my presupposition of your train of thought, when you think of Perl, you \r\n    most likely think of regular expressions. Subsequently, when you think of \r\n    regular expressions, you probably think, \"Yuck!\"\r\n   <\/p>\r\n   <p>\r\n    For those of you unfamiliar with <a \r\n    href=\"http:\/\/en.wikipedia.org\/wiki\/Regular_expression\">regular \r\n    expressions<\/a>, they provide a mechanism to describe patterns in text, for \r\n    matching or replacing. They are generally considered very useful, yet ugly, \r\n    and difficult to understand and use. Perl was created for processing text, \r\n    and has regular expressions deeply ingrained into its language. Its most \r\n    basic operators are match and substitute, both of which have regular \r\n    expressions built into them. Virtually every Perl program uses regular \r\n    expressions, and it is likely that thoughts of regular expressions lead to \r\n    Perl, as much as the other way around.\r\n   <\/p>\r\n\r\n   <p>\r\n    One of MATLAB's hidden strengths is its ability to handle text processing.\r\n    MATLAB supports all of the requisite file\r\n    I\/O functions, and provides a wide selection of string\r\n    functions, but most importantly, MATLAB has builtin regular\r\n    expressions.\r\n   <\/p>\r\n   <p>\r\n    Text processing plays right into MATLAB's forte: matrices. In Perl, strings \r\n    are an atomic data type, which lends them to special care from the language. \r\n    In MATLAB, strings are one dimensional matrices of type char, and MATLAB can \r\n    treat them as it does other matrices. This is useful if you want to perform \r\n    math on a string. This might sound preposterous if you are thinking about \r\n    taking the eigenvalue of your text file, however useful math could be \r\n    performed on text, for example with cryptography.\r\n   <\/p>\r\n\r\n   <p>\r\n    For example, you may have seen an <a \r\n    href=\"http:\/\/www.snopes.com\/language\/apocryph\/cambridge.asp\">article<\/a> \r\n    that has circulated on the internet that speculates that people can easily \r\n    read text with each word scrambled, as long as the first and last letters of \r\n    each word are preserved. Jamie Zawinski has written a <a \r\n    href=\"http:\/\/www.jwz.org\/hacks\/scrmable.pl\">Perl script<\/a> to perform \r\n    this scrambling on user input. Here is that script stripped to its core code:\r\n   <\/p>\r\n<pre class=\"code\">\r\nwhile (&lt;&gt;) {\r\n    foreach (split (\/(\\w+)\/)) {\r\n        if (m\/\\w\/) {\r\n            my @w = split (\/\/);\r\n            my $A = shift @w;\r\n            my $Z = pop @w;\r\n            print $A;\r\n            if (defined ($Z)) {\r\n                my $i = $#w+1;\r\n                while ($i--) {\r\n                    my $j = int rand ($i+1);\r\n                    @w[$i,$j] = @w[$j,$i];\r\n                }\r\n                foreach (@w) {\r\n                    print $_;\r\n                }\r\n                print $Z;\r\n            }\r\n        } else {\r\n            print \"$_\";\r\n        }\r\n    }\r\n}\r\n<\/pre>\r\n   <p>\r\n\r\n    In MATLAB it is possible to do this easily, using dynamic\r\n    regular expressions, a new feature of <kbd>regexp<\/kbd> in Release 2006a.\r\n   <\/p>\r\n<pre class=\"code\">\r\nwhile true\r\n    line = input('', 's');\r\n    line = regexprep(line, '(?&lt;=\\w)\\w{2,}(?=\\w)', '${$0(randperm(length($0)))}');\r\n    disp(line);\r\nend\r\n<\/pre>\r\n   <p>\r\n    Note that the regular expression is evaluating MATLAB code before replacing \r\n    the text. This makes it possible to call MATLAB's <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/randperm.html\"><kbd>randperm<\/kbd><\/a> \r\n    function.\r\n   <\/p>\r\n\r\n   <p>\r\n    With MATLAB's support of string functions, notably <kbd>regexp<\/kbd>, it is easier to \r\n    use, and subsequently more effective than Perl for text processing.\r\n   <\/p>\r\n   <h3>\r\n    Advantages to MATLAB <kbd>regexp<\/kbd>\r\n   <\/h3>\r\n   <p>\r\n    There are several differences betwixt the Perl and MATLAB implementations of \r\n    regular expressions. Most of these differences are nuances in the languages, \r\n    such as default options and syntax for word breaks. Some advantageous \r\n    aspects of regular expressions in MATLAB over Perl are named tokens and case \r\n    preservation. \r\n   <\/p>\r\n   <h4>\r\n\r\n    Named Tokens\r\n   <\/h4>\r\n  <table rowspan=1 colspan=1 border=0 cellpadding=10>\r\n  <tr><td>\r\n   <p>\r\n    I have found that once someone starts using regular expressions, a \r\n    compelling feature they discover is tokens, also known as capture groups. Tokens \r\n    are created by top level parenthetical subexpressions in a regular \r\n    expression. These tokens are then available as outputs to the <kbd><a \r\n    href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html\">regexp<\/a><\/kbd>\r\n    function, as backreferences later in the same pattern, and as arguments that \r\n    modify replacement text in <kbd><a \r\n    href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexprep.html\">regexprep<\/a><\/kbd>. \r\n    If you like using tokens in your regular expressions, you will love using named\r\n    tokens.\r\n   <\/p>\r\n\r\n   <p>\r\n    The named tokens feature allows you to specify names for the parenthetical \r\n    subexpressions that capture tokens. Then the tokens may be referred to by \r\n    name, as opposed to by number. This makes expressions clearer, as well as \r\n    less prone to bugs. This is because appending two expressions will change the \r\n    indices of the tokens. For example, write an expression to extract the \r\n    month, day and year from a date without using named tokens:\r\n   <\/p>\r\n<pre class=\"code\">\r\n&gt;&gt; date = regexp('11\/26\/1977', '(\\d+)\/(\\d+)\/(\\d+)', 'tokens');\r\n&gt;&gt; date{:}\r\n\r\nans = \r\n\r\n    '11'    '26'    '1977'\r\n<\/pre>\r\n   <p>\r\n    If the above code were in a function, day would be referred to as \r\n    <kbd>date{2}<\/kbd>. Converting this example to use named tokens:\r\n   <\/p>\r\n\r\n<pre class=\"code\">\r\n&gt;&gt; date = regexp('11\/26\/1977', '(?&lt;month&gt;\\d+)\/(?&lt;day&gt;\\d+)\/(?&lt;year&gt;\\d+)', 'names')\r\n\r\ndate = \r\n\r\n    month: '11'\r\n      day: '26'\r\n     year: '1977'\r\n<\/pre>\r\n   <p>\r\n    Now day may be referred to as <kbd>date.day<\/kbd>. What if the pattern also \r\n    needs to match dates in the European format of <i>dd.mm.yyyy<\/i>? The first \r\n    pattern could be written as:\r\n   <\/p>\r\n\r\n<pre class=\"code\">\r\n&gt;&gt; date = regexp('11\/26\/1977', '(\\d+)\/(\\d+)\/(\\d+)|(\\d+).(\\d+).(\\d+)', 'tokens');\r\n&gt;&gt; date{:}\r\n\r\nans = \r\n\r\n    '11'    '26'    '1977'\r\n\r\n&gt;&gt; date = regexp('26.11.1977', '(\\d+)\/(\\d+)\/(\\d+)|(\\d+).(\\d+).(\\d+)', 'tokens');\r\n&gt;&gt; date{:}\r\n\r\nans = \r\n\r\n    '26'    '11'    '1977'\r\n<\/pre>\r\n   <p>\r\n    But, see how the order of the tokens is indistinguishable. The same example \r\n    can be fixed using named tokens:\r\n   <\/p>\r\n<pre class=\"code\">\r\n&gt;&gt; date = regexp('11\/26\/1977', ...\r\n'(?&lt;month&gt;\\d+)\/(?&lt;day&gt;\\d+)\/(?&lt;year&gt;\\d+)|(?&lt;day&gt;\\d+).(?&lt;month&gt;\\d+).(?&lt;year&gt;\\d+)',...\r\n'names')\r\n\r\ndate = \r\n\r\n    month: '11'\r\n      day: '26'\r\n     year: '1977'\r\n\r\n\r\n&gt;&gt; date = regexp('26.11.1977', ...\r\n'(?&lt;month&gt;\\d+)\/(?&lt;day&gt;\\d+)\/(?&lt;year&gt;\\d+)|(?&lt;day&gt;\\d+).(?&lt;month&gt;\\d+).(?&lt;year&gt;\\d+)',...\r\n'names')\r\n\r\ndate = \r\n\r\n    month: '11'\r\n      day: '26'\r\n     year: '1977'\r\n<\/pre>\r\n   <p>\r\n\r\n    As you can see, named tokens clarify the expression, results and surrounding code.\r\n   <\/p>\r\n   <\/td><\/tr><\/table>\r\n   <h4>\r\n    Case Preservation\r\n   <\/h4>\r\n  <table rowspan=1 colspan=1 border=0 cellpadding=10>\r\n  <tr><td>\r\n   <p>\r\n    Case insensitivity is such a fundamental aspect of pattern matching that \r\n    MATLAB emphasizes it as a separate builtin function, <kbd>regexpi<\/kbd>. \r\n    Many programs use this option in search and replace functions. \r\n    Unfortunately, although these instances match case insensitively, they do \r\n    not properly address the need to also have dynamic replacements. As an \r\n    example, a word capitalized at the beginning of a sentence should remain \r\n    capitalized after a replacement. MATLAB recognizes this need. In addition to \r\n    providing options that will ignore the case and match the case, MATLAB also \r\n    supplies the option to preserve the case. To support this, \r\n    <kbd>regexprep<\/kbd> has the following options to handle case: \r\n    <kbd>matchcase<\/kbd>, <kbd>ignorecase<\/kbd> and <kbd>preservecase<\/kbd>. The \r\n    differences are shown in these examples:\r\n   <\/p>\r\n\r\n<pre class=\"code\">\r\n&gt;&gt; regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'matchcase')\r\n\r\nans =\r\n\r\nThe Car, The car, THE CAR, the car, a BOAT\r\n\r\n&gt;&gt; regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'ignorecase')\r\n\r\nans =\r\n\r\na BOAT, a BOAT, a BOAT, a BOAT, a BOAT\r\n\r\n&gt;&gt; regexprep('The Car, The car, THE CAR, the car, THE car', 'THE car', 'a BOAT', 'preservecase')\r\n\r\nans =\r\n\r\nA Boat, A boat, A BOAT, a boat, a BOAT\r\n<\/pre>\r\n   <p>\r\n    Note how the <kbd>preservecase<\/kbd> option does what you would most likely \r\n    want it to.\r\n   <\/p>\r\n   <\/td><\/tr><\/table>\r\n   <h3>\r\n\r\n    MATLAB for Regular Expressions\r\n   <\/h3>\r\n   <p>\r\n    The MATLAB regular expression functions are fully featured and refined text \r\n    processing tools. Along with the other support that MATLAB provides for text \r\n    manipulation, the suite within MATLAB is the easiest and cleanest way to \r\n    write string processing code. When you next think of text processing, think \r\n    of MATLAB, and maybe you won't have to think, \"Yuck!\"\r\n   <\/p>\r\n   <p>\r\n   <ul>\r\n   <li>\r\n    Do you use MATLAB yet defer to Perl for text processing?\r\n   <\/li>\r\n   <li>\r\n    Is there functionality that you use in Perl for text processing that is absent from MATLAB?\r\n   <\/li>\r\n   <\/ul>\r\n    Tell me about it <a href=\"?p=27#respond\">here<\/a>.\r\n   <\/p>\r\n   <p>\r\n    &nbsp;\r\n   <\/p>","protected":false},"excerpt":{"rendered":"<p>\r\nI'm pleased to introduce Jason Breslau, our guest blogger this week, who gives us his take on MATLAB, strings, and regular expressions.\r\n\r\n\r\n   \r\n    When you think about text processing, you... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2006\/04\/05\/regexp-how-tos\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/27"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=27"}],"version-history":[{"count":2,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/27\/revisions"}],"predecessor-version":[{"id":1796,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/27\/revisions\/1796"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=27"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=27"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=27"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}