Fast, programmatic string search in MATLAB files

Jiro's pick this week is findInM by our very own Brett Shoelson.

If you know Brett, which you probably do if you've spent any time on MATLAB Central, you know about all the useful File Exchange entries he has contributed. It's no surprise he's ranked number 8. Aside from many of the entries related to image processing, he has uploaded a number of utility functions, and findInM is a must-have. The title of the File Exchange entry starts with "FAST, PROGRAMMATIC string searching...", and that's exactly what it is. It searches for a string of text in MATLAB files, but it's fast and it's programmatic. There are already a number of entries in File Exchange that searches for a text within files, including mfilegrep, mgrep, and grep. There is also an interactive way of searching from the toolstrip.

From the description of Brett's entry, findInM "can be much faster than any other method [he's] seen." The way Brett accomplishes this efficient search is by first creating an index of the MATLAB files in the folders. This step takes some time, but afterwards, the search happens on the index file and is very efficient.

In this example, I first created an index of my /toolbox folder (and its subfolders) of my MATLAB installation. Then searching for some text from over 20000 files took less than 10 seconds.

tic
s = findInM('graph theory','toolbox')
toc
SORTED BY DATE, MOST RECENT ON TOP:

s =
'C:\Program Files\MATLAB\R2014b\toolbox\bioinfo\bioinfo\Contents.m'
'C:\Program Files\MATLAB\R2014b\toolbox\bioinfo\biodemos\graphtheorydemo.m'
'C:\Program Files\MATLAB\R2014b\toolbox\matlab\sparfun\Contents.m'
'C:\Program Files\MATLAB\R2014b\toolbox\matlab\sparfun\gplot.m'
'C:\Program Files\MATLAB\R2014b\toolbox\bioinfo\graphtheory\Contents.m'
Elapsed time is 8.334972 seconds.


As a comparison, the interactive "Find Files" tool in MATLAB took over 5 minutes to do the same search.

Thanks for this great tool, Brett! I do have a couple of suggestions for improvement.

• Every 7 days, the function prompts the user to see if he/she wants to re-generate the index file. Perhaps this could be somewhat automated if the indexing process captured the state of the files (file sizes, modified dates). It could automatically recommend re-generating the index if it notices a change in the state.
• The index file is a DOC file. It's easy to open/edit a DOC file. It might be better to use a non-standard extension, so that it can't be accidentally opened and is easily distinguished from a regular DOC file. For example, in Windows, some folders with images have a thumbnail database called "Thumbs.db". Perhaps findInM can create a file called "mIndex.mi" or something like that.