Stressed When Searching for Strings?
Do you get clammy hands when you have to search for a string pattern, not just a particular string? Does the thought of struggling with regexp make you sweat?
Well worry no more! Many of your searches may now be done more easily using the new pattern feature in MATLAB. And in some cases, you can get away with even less.
For today's post, my co-authors are Jason Breslau and Curtis Anderson, since they know MUCH more about regexp than me, and many more nuances about the functionality. We're going to do this by showcasing a few examples. You might also want to check out Jiro's recent Pick of the Week.
Contents
- Example 0: For Those Who Love regexp
- Example 1: Counting Comment Lines
- Example 2: How to Find Words in a File Starting with a Vowel
- Example 3: Converse - Find Words Beginning with Consonants
- Example 4: Looking for Files with Certain Extensions
- What about a File with No Extension
- How's Your Search Going?
- Appendix
Example 0: For Those Who Love regexp
This example is for you if you love regexp and don't see why you should consider anything else. You can use regexpPattern to convert your favorite regular expression to a pattern so you can take advantages of code features. Compare these two ways to see if a string is contained in some text.
contains(str, regexpPattern(expr))
~cellfun('isempty',regexp(str,expr))
Which one of these can you quickly understand without having to go through the logic each time you read it?
And now more for those who really would prefer to skip regexp more often.
Example 1: Counting Comment Lines
Suppose I want to count the lines in a MATLAB file that are comments, (not block comments). Here's how to do this with regexp:
codeFile = fileread('num2str.m'); comments = regexp(codeFile, '^\s*%', 'lineanchors'); numel(comments)
ans = 37
That's annoying as you had to know about lineanchors, and that regular expression is a little ugly. Plus, it returned an array of indices, that we don't really care about. Instead, try this:
count(codeFile, lineBoundary + whitespacePattern + "%")
ans = 37
We still need to read the file, and we look for lines (ignoring leading whitespace), that effectively start with %.
Example 2: How to Find Words in a File Starting with a Vowel
Suppose we want to get some statistics about words starting with a vowel.
vowelWords = regexpi(codeFile, '\<[aeiou][a-z]*', 'match'); howManyWords = length(vowelWords)
howManyWords = 176
Using pattern, we first search for words, which are alphabetic characters. And then look only for the ones starting with a vowel.
words = extract(codeFile, lettersPattern); vowelWords1 = words(startsWith(words, characterListPattern('aeiou'),'IgnoreCase', true)); howManyWords = length(vowelWords1)
howManyWords = 176
And here's perhaps an even better way to do this! Build a pattern from a list of the vowels. And then look for something that has a boundary before a letter - some whitespace, followed by a vowel and then for the possible rest of the word.
vowel = caseInsensitivePattern(characterListPattern("aeiou"));
vowelWords2Pat = letterBoundary + vowel + lettersPattern(0,inf);
vowelWords2 = extract(codeFile, vowelWords2Pat);
howManyWords = length(vowelWords2)
howManyWords = 176
Assuming we only want the count here, we can replace the previous last 2 lines of code with a call to count and more efficiently attain our goal. The great thing about the workflow of building up the complex pattern is the versatility it affords you.
hmw = count(codeFile,vowelWords2Pat,"IgnoreCase", true)
hmw = 176
Or I could replace lettersPattern(0,Inf) with optionalPattern(lettersPattern). Being able to give patterns to functions like count, startsWith and contains is the biggest win.
Best Practice
We have found that it is best to build up a pattern by joining smaller pieces. It makes it easier to understand what you are doing, where you are or are not applying case sensitivity, etc.
Example 3: Converse - Find Words Beginning with Consonants
Suppose instead we want words starting with consonants. Here's the regexp way.
consRegexp = regexpi(codeFile, '\<(?![aeiou])[a-z]+', 'match');
And using a pattern
consPat = extract(codeFile, ... alphanumericBoundary + ... ~lookAheadBoundary(caseInsensitivePattern(characterListPattern('aeiou')))... + lettersPattern);
And finally using neither regexp nor a consonant pattern. Instead, use the negation of the starting with vowel words. This is the easiest to understand, perhaps.
consWords = words(~startsWith(words, caseInsensitivePattern(characterListPattern('aeiou'))));
The astute reader will see that the answers here do not agree. See the Appendix at the end for details and how to get the answers to align.
Example 4: Looking for Files with Certain Extensions
We audited all of the regular expressions used in one stage in our test system, and found that around 50% of them could be replaced by endsWith with NO PATTERN at all. Previously we used regexp but that is a huge hammer for the job. I think looking for files with a particular file extension may have been a common use case. like,
regexp(fileName, '.txt$')
which has two bugs! You need isempty, and 'once':
~isempty(regexp(fileName, '.txt$', 'once'))
And you also have to escape the dot, which everyone forgets to do.
~isempty(regexp(fileName, '\.txt$', 'once'))
Instead now you simply do
endsWith(fileName, '.txt')
The interesting things is that this uses no pattern at all but uses a function, endsWith, that could take a pattern.
Suppose you now want to check for 2 different extensions. Easily done. endsWith supports multiple search strings, and treats them as an or. This is faster but a bit more limited than doing a search with a proper pattern.
endsWith(fileName, [".txt", ".somethingElse"])
pattern with explicit or
endsWith(fileName, ".txt" | ".somethingElse")
What about a File with No Extension
What if a filename ends with either txt or no extension at all?
endsWith(fileName, '.txt') || ~contains(fileName, '.')
This is for a single file, without a full pathname.
How's Your Search Going?
Are you able to make good use of patterns in MATLAB and are able (or not) to eliminate some or all uses of regexp. Let us know here.
Appendix
As promised, I will describe here why the consonant answers do not agree and how to make them the same.
The words variable has groups of consecutive letters. And we had some names in the num2str code using numbers as well, e.g., mat2str. This translated into 2 words, mat and str. We can fix this using
words = extract(codeFile, alphanumericBoundary ... + lettersPattern + alphanumericBoundary);
This means the regexp version is:
consRegexp = regexpi(codeFile, '\<(?![aeiou])[a-z]+\>', 'match');
and the corresponding pattern:
consPat = extract(codeFile, ... alphanumericBoundary + ... ~lookAheadBoundary(caseInsensitivePattern(characterListPattern('aeiou')))... + lettersPattern … + alphanumericBoundary);
Phew! That's a mouthful! But pretty readable too.
- 범주:
- New Feature,
- Strings