grep – Text searching utility
Jiro's pick this week is grep: a pedestrian, very fast grep utility by Us.
This week's pick is a recommendation from Yair, who himself is a prominent participant on MATLAB Central. Us has created a number of very useful functions over the years, and there's nothing pedestrian about his entries!
This week, I was in Seattle presenting at University of Washington, and during the seminar I received a question about the best way to scan through a file for a certain text pattern that denotes the beginning of data. She was scanning the file line by line using textscan to find the text of interest, and she was wondering if there was a more efficient way. The method she described seemed pretty reasonable. textscan is very efficient in scanning text files, and it's a good function for dealing with extremely large files, if you want to read them in chunks. Then I remembered Us's grep. I remembered Yair had suggested it, and that there were many positive responses to the entry. I suggested that she take a look at the function and to use it in conjunction with textscan to do the data reading afterwards.
Here's a quick example of how it works. In a folder called "data_files", I have 10 files, each of which contains experimental data from 100 tests. Each test is separated by a line indicating the test number.
Test 1 1.776199 3.552398 5.328597 . . . Test 2 7.250518 4.510056 5.797272 . . . Test 3 . . .
Because each file contains different number of data points, the file sizes are different.
fInfo = dir('data_files/*.txt'); s = [{fInfo.name}; num2cell([fInfo.bytes]/2^20)]; fprintf('%s: %4.1f MB\n', s{:})
ModelResults01.txt: 8.2 MB ModelResults02.txt: 7.7 MB ModelResults03.txt: 13.0 MB ModelResults04.txt: 1.9 MB ModelResults05.txt: 1.6 MB ModelResults06.txt: 24.1 MB ModelResults07.txt: 11.8 MB ModelResults08.txt: 9.1 MB ModelResults09.txt: 20.6 MB ModelResults10.txt: 20.7 MB
Let's say that I want to extract data for Test 60. To identify the lines where Test 60 starts and ends, we can look for the texts "Test 60" and "Test 61".
tic; [fl, p] = grep('-u -n', {'Test 60', 'Test 61'}, 'data_files/*.txt'); disp(' '); toc;
ModelResults01.txt:175704: Test 60 ModelResults01.txt:178682: Test 61 ModelResults02.txt:164907: Test 60 ModelResults02.txt:167702: Test 61 ModelResults03.txt:278423: Test 60 ModelResults03.txt:283142: Test 61 ModelResults04.txt:40417: Test 60 ModelResults04.txt:41102: Test 61 ModelResults05.txt:33337: Test 60 ModelResults05.txt:33902: Test 61 ModelResults06.txt:514069: Test 60 ModelResults06.txt:522782: Test 61 ModelResults07.txt:251578: Test 60 ModelResults07.txt:255842: Test 61 ModelResults08.txt:194584: Test 60 ModelResults08.txt:197882: Test 61 ModelResults09.txt:440201: Test 60 ModelResults09.txt:447662: Test 61 ModelResults10.txt:440850: Test 60 ModelResults10.txt:448322: Test 61 Elapsed time is 1.997956 seconds.
As you can see from the comments on the File Exchange entry page, the function runs extremely efficiently. The outputs from the function provide details about the search result, including line numbers. What I like most about Us's entry is the extensive HTML help he has on the function. He explains all the various options grep takes and the results structure that it returns, and he includes several examples that get you started.
Thanks Us for this great utility and Yair for the recommendation!
Comments
If you haven't used this, give it a spin and let us know what you think here or leave a comment for Us.
Please keep nominating your favorite File Exchange entries here.
- Category:
- Picks
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.