grep – Text searching utility

Posted by Jiro Doke, May 25, 2012

2 views (last 30 days) | 0 Likes | 0 comment

Jiro's pick this week is grep: a pedestrian, very fast grep utility by Us.

This week's pick is a recommendation from Yair, who himself is a prominent participant on MATLAB Central. Us has created a number of very useful functions over the years, and there's nothing pedestrian about his entries!

This week, I was in Seattle presenting at University of Washington, and during the seminar I received a question about the best way to scan through a file for a certain text pattern that denotes the beginning of data. She was scanning the file line by line using textscan to find the text of interest, and she was wondering if there was a more efficient way. The method she described seemed pretty reasonable. textscan is very efficient in scanning text files, and it's a good function for dealing with extremely large files, if you want to read them in chunks. Then I remembered Us's grep. I remembered Yair had suggested it, and that there were many positive responses to the entry. I suggested that she take a look at the function and to use it in conjunction with textscan to do the data reading afterwards.

Here's a quick example of how it works. In a folder called "data_files", I have 10 files, each of which contains experimental data from 100 tests. Each test is separated by a line indicating the test number.

Test 1 1.776199 3.552398 5.328597 . . . Test 2 7.250518 4.510056 5.797272 . . . Test 3 . . .

Because each file contains different number of data points, the file sizes are different.

fInfo = dir('data_files/*.txt');
s = [{fInfo.name}; num2cell([fInfo.bytes]/2^20)];
fprintf('%s: %4.1f MB\n', s{:})

ModelResults01.txt:  8.2 MB
ModelResults02.txt:  7.7 MB
ModelResults03.txt: 13.0 MB
ModelResults04.txt:  1.9 MB
ModelResults05.txt:  1.6 MB
ModelResults06.txt: 24.1 MB
ModelResults07.txt: 11.8 MB
ModelResults08.txt:  9.1 MB
ModelResults09.txt: 20.6 MB
ModelResults10.txt: 20.7 MB

Let's say that I want to extract data for Test 60. To identify the lines where Test 60 starts and ends, we can look for the texts "Test 60" and "Test 61".

tic;
[fl, p] = grep('-u -n', {'Test  60', 'Test  61'}, 'data_files/*.txt');
disp(' '); toc;

ModelResults01.txt:175704: Test  60
ModelResults01.txt:178682: Test  61
ModelResults02.txt:164907: Test  60
ModelResults02.txt:167702: Test  61
ModelResults03.txt:278423: Test  60
ModelResults03.txt:283142: Test  61
ModelResults04.txt:40417: Test  60
ModelResults04.txt:41102: Test  61
ModelResults05.txt:33337: Test  60
ModelResults05.txt:33902: Test  61
ModelResults06.txt:514069: Test  60
ModelResults06.txt:522782: Test  61
ModelResults07.txt:251578: Test  60
ModelResults07.txt:255842: Test  61
ModelResults08.txt:194584: Test  60
ModelResults08.txt:197882: Test  61
ModelResults09.txt:440201: Test  60
ModelResults09.txt:447662: Test  61
ModelResults10.txt:440850: Test  60
ModelResults10.txt:448322: Test  61
 
Elapsed time is 1.997956 seconds.

As you can see from the comments on the File Exchange entry page, the function runs extremely efficiently. The outputs from the function provide details about the search result, including line numbers. What I like most about Us's entry is the extensive HTML help he has on the function. He explains all the various options grep takes and the results structure that it returns, and he includes several examples that get you started.

Thanks Us for this great utility and Yair for the recommendation!

Comments

If you haven't used this, give it a spin and let us know what you think here or leave a comment for Us.

Please keep nominating your favorite File Exchange entries here.

Published with MATLAB® 7.14