Loren on the Art of MATLAB

Turn ideas into MATLAB

Stressed When Searching for Strings?

Do you get clammy hands when you have to search for a string pattern, not just a particular string? Does the thought of struggling with regexp make you sweat?

Well worry no more! Many of your searches may now be done more easily using the new pattern feature in MATLAB. And in some cases, you can get away with even less.

For today's post, my co-authors are Jason Breslau and Curtis Anderson, since they know MUCH more about regexp than me, and many more nuances about the functionality. We're going to do this by showcasing a few examples. You might also want to check out Jiro's recent Pick of the Week.

Contents

Example 0: For Those Who Love regexp

This example is for you if you love regexp and don't see why you should consider anything else. You can use regexpPattern to convert your favorite regular expression to a pattern so you can take advantages of code features. Compare these two ways to see if a string is contained in some text.

               contains(str, regexpPattern(expr))
               ~cellfun('isempty',regexp(str,expr))

Which one of these can you quickly understand without having to go through the logic each time you read it?

And now more for those who really would prefer to skip regexp more often.

Example 1: Counting Comment Lines

Suppose I want to count the lines in a MATLAB file that are comments, (not block comments). Here's how to do this with regexp:

codeFile = fileread('num2str.m');
comments = regexp(codeFile, '^\s*%', 'lineanchors');
numel(comments)
ans =
    37

That's annoying as you had to know about lineanchors, and that regular expression is a little ugly. Plus, it returned an array of indices, that we don't really care about. Instead, try this:

count(codeFile, lineBoundary + whitespacePattern + "%")
ans =
    37

We still need to read the file, and we look for lines (ignoring leading whitespace), that effectively start with %.

Example 2: How to Find Words in a File Starting with a Vowel

Suppose we want to get some statistics about words starting with a vowel.

vowelWords = regexpi(codeFile, '\<[aeiou][a-z]*', 'match');
howManyWords = length(vowelWords)
howManyWords =
   176

Using pattern, we first search for words, which are alphabetic characters. And then look only for the ones starting with a vowel.

words = extract(codeFile, lettersPattern);
vowelWords1 = words(startsWith(words, characterListPattern('aeiou'),'IgnoreCase', true));
howManyWords = length(vowelWords1)
howManyWords =
   176

And here's perhaps an even better way to do this! Build a pattern from a list of the vowels. And then look for something that has a boundary before a letter - some whitespace, followed by a vowel and then for the possible rest of the word.

vowel = caseInsensitivePattern(characterListPattern("aeiou"));
vowelWords2Pat = letterBoundary + vowel + lettersPattern(0,inf);
vowelWords2 = extract(codeFile, vowelWords2Pat);
howManyWords = length(vowelWords2)
howManyWords =
   176

Assuming we only want the count here, we can replace the previous last 2 lines of code with a call to count and more efficiently attain our goal. The great thing about the workflow of building up the complex pattern is the versatility it affords you.

hmw = count(codeFile,vowelWords2Pat,"IgnoreCase", true)
hmw =
   176

Or I could replace lettersPattern(0,Inf) with optionalPattern(lettersPattern). Being able to give patterns to functions like count, startsWith and contains is the biggest win.

Best Practice

We have found that it is best to build up a pattern by joining smaller pieces. It makes it easier to understand what you are doing, where you are or are not applying case sensitivity, etc.

Example 3: Converse - Find Words Beginning with Consonants

Suppose instead we want words starting with consonants. Here's the regexp way.

consRegexp = regexpi(codeFile, '\<(?![aeiou])[a-z]+', 'match');

And using a pattern

consPat = extract(codeFile, ...
    alphanumericBoundary + ...
    ~lookAheadBoundary(caseInsensitivePattern(characterListPattern('aeiou')))...
    + lettersPattern);

And finally using neither regexp nor a consonant pattern. Instead, use the negation of the starting with vowel words. This is the easiest to understand, perhaps.

consWords = words(~startsWith(words, caseInsensitivePattern(characterListPattern('aeiou'))));

The astute reader will see that the answers here do not agree. See the Appendix at the end for details and how to get the answers to align.

Example 4: Looking for Files with Certain Extensions

We audited all of the regular expressions used in one stage in our test system, and found that around 50% of them could be replaced by endsWith with NO PATTERN at all. Previously we used regexp but that is a huge hammer for the job. I think looking for files with a particular file extension may have been a common use case. like,

               regexp(fileName, '.txt$')

which has two bugs! You need isempty, and 'once':

               ~isempty(regexp(fileName, '.txt$', 'once'))

And you also have to escape the dot, which everyone forgets to do.

               ~isempty(regexp(fileName, '\.txt$', 'once'))

Instead now you simply do

               endsWith(fileName, '.txt')

The interesting things is that this uses no pattern at all but uses a function, endsWith, that could take a pattern.

Suppose you now want to check for 2 different extensions. Easily done. endsWith supports multiple search strings, and treats them as an or. This is faster but a bit more limited than doing a search with a proper pattern.

               endsWith(fileName, [".txt", ".somethingElse"])

pattern with explicit or

               endsWith(fileName, ".txt" | ".somethingElse")

What about a File with No Extension

What if a filename ends with either txt or no extension at all?

               endsWith(fileName, '.txt') || ~contains(fileName, '.')

This is for a single file, without a full pathname.

How's Your Search Going?

Are you able to make good use of patterns in MATLAB and are able (or not) to eliminate some or all uses of regexp. Let us know here.

Appendix

As promised, I will describe here why the consonant answers do not agree and how to make them the same.

The words variable has groups of consecutive letters. And we had some names in the num2str code using numbers as well, e.g., mat2str. This translated into 2 words, mat and str. We can fix this using

               words = extract(codeFile, alphanumericBoundary ...
                       + lettersPattern + alphanumericBoundary);

This means the regexp version is:

               consRegexp = regexpi(codeFile, '\<(?![aeiou])[a-z]+\>', 'match');

and the corresponding pattern:

               consPat = extract(codeFile, ...
                 alphanumericBoundary + ...
                 ~lookAheadBoundary(caseInsensitivePattern(characterListPattern('aeiou')))...
                 + lettersPattern …
                 + alphanumericBoundary);

Phew! That's a mouthful! But pretty readable too.




Published with MATLAB® R2020b

|
  • print
  • send email

コメント

コメントを残すには、ここ をクリックして MathWorks アカウントにサインインするか新しい MathWorks アカウントを作成します。