{"id":3576,"date":"2012-05-25T08:26:26","date_gmt":"2012-05-25T13:26:26","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=3576"},"modified":"2012-05-25T12:03:59","modified_gmt":"2012-05-25T17:03:59","slug":"grep-text-searching-utility","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2012\/05\/25\/grep-text-searching-utility\/","title":{"rendered":"grep &#8211; Text searching utility"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/15007\">Jiro<\/a>'s pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/9647-grep--a-pedestrian--very-fast-grep-utility\">grep: a pedestrian, very fast grep utility<\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/4309\">Us<\/a>.\r\n   <\/p>\r\n   <p>This week's pick is a recommendation from <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/27420\">Yair<\/a>, who himself is a prominent participant on MATLAB Central. Us has created a number of very useful functions over the years,\r\n      and there's nothing pedestrian about his entries!\r\n   <\/p>\r\n   <p>This week, I was in Seattle presenting at University of Washington, and during the seminar I received a question about the\r\n      best way to scan through a file for a certain text pattern that denotes the beginning of data. She was scanning the file line\r\n      by line using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html\"><tt>textscan<\/tt><\/a> to find the text of interest, and she was wondering if there was a more efficient way. The method she described seemed pretty\r\n      reasonable. <tt>textscan<\/tt> is very efficient in scanning text files, and it's a good function for dealing with extremely large files, if you want to\r\n      read them in chunks. Then I remembered Us's <tt>grep<\/tt>. I remembered Yair had <a href=\"https:\/\/blogs.mathworks.com\/pick\/2012\/04\/13\/what-is-your-favorite-unrecognized-file-exchange-submission\/#comment-14491\">suggested<\/a> it, and that there were many positive responses to the entry. I suggested that she take a look at the function and to use\r\n      it in conjunction with <tt>textscan<\/tt> to do the data reading afterwards.\r\n   <\/p>\r\n   <p>Here's a quick example of how it works. In a folder called \"data_files\", I have 10 files, each of which contains experimental\r\n      data from 100 tests. Each test is separated by a line indicating the test number.\r\n   <\/p><pre>   Test  1\r\n   1.776199\r\n   3.552398\r\n   5.328597\r\n     .\r\n     .\r\n     .\r\n   Test  2\r\n   7.250518\r\n   4.510056\r\n   5.797272\r\n     .\r\n     .\r\n     .\r\n   Test  3\r\n     .\r\n     .\r\n     .<p><\/p><\/pre><p>Because each file contains different number of data points, the file sizes are different.<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">fInfo = dir(<span style=\"color: #A020F0\">'data_files\/*.txt'<\/span>);\r\ns = [{fInfo.name}; num2cell([fInfo.bytes]\/2^20)];\r\nfprintf(<span style=\"color: #A020F0\">'%s: %4.1f MB\\n'<\/span>, s{:})<\/pre><pre style=\"font-style:oblique\">ModelResults01.txt:  8.2 MB\r\nModelResults02.txt:  7.7 MB\r\nModelResults03.txt: 13.0 MB\r\nModelResults04.txt:  1.9 MB\r\nModelResults05.txt:  1.6 MB\r\nModelResults06.txt: 24.1 MB\r\nModelResults07.txt: 11.8 MB\r\nModelResults08.txt:  9.1 MB\r\nModelResults09.txt: 20.6 MB\r\nModelResults10.txt: 20.7 MB\r\n<\/pre><p><p>Let's say that I want to extract data for Test 60. To identify the lines where Test 60 starts and ends, we can look for the\r\n      texts \"Test  60\" and \"Test  61\".\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">tic;\r\n[fl, p] = grep(<span style=\"color: #A020F0\">'-u -n'<\/span>, {<span style=\"color: #A020F0\">'Test  60'<\/span>, <span style=\"color: #A020F0\">'Test  61'<\/span>}, <span style=\"color: #A020F0\">'data_files\/*.txt'<\/span>);\r\ndisp(<span style=\"color: #A020F0\">' '<\/span>); toc;<\/pre><pre style=\"font-style:oblique\">ModelResults01.txt:175704: Test  60\r\nModelResults01.txt:178682: Test  61\r\nModelResults02.txt:164907: Test  60\r\nModelResults02.txt:167702: Test  61\r\nModelResults03.txt:278423: Test  60\r\nModelResults03.txt:283142: Test  61\r\nModelResults04.txt:40417: Test  60\r\nModelResults04.txt:41102: Test  61\r\nModelResults05.txt:33337: Test  60\r\nModelResults05.txt:33902: Test  61\r\nModelResults06.txt:514069: Test  60\r\nModelResults06.txt:522782: Test  61\r\nModelResults07.txt:251578: Test  60\r\nModelResults07.txt:255842: Test  61\r\nModelResults08.txt:194584: Test  60\r\nModelResults08.txt:197882: Test  61\r\nModelResults09.txt:440201: Test  60\r\nModelResults09.txt:447662: Test  61\r\nModelResults10.txt:440850: Test  60\r\nModelResults10.txt:448322: Test  61\r\n \r\nElapsed time is 1.997956 seconds.\r\n<\/pre><p><p>As you can see from the comments on the File Exchange entry page, the function runs extremely efficiently. The outputs from\r\n      the function provide details about the search result, including line numbers. What I like most about Us's entry is the extensive\r\n      HTML help he has on the function. He explains all the various options <tt>grep<\/tt> takes and the results structure that it returns, and he includes several examples that get you started.\r\n   <\/p>\r\n   <p>Thanks Us for this great utility and Yair for the recommendation!<\/p>\r\n   <p><b>Comments<\/b><\/p>\r\n   <p>If you haven't used this, give it a spin and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=3576#respond\">here<\/a> or leave a <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/9647-grep--a-pedestrian--very-fast-grep-utility#comments\">comment<\/a> for Us.\r\n   <\/p>\r\n   <p>Please keep nominating your favorite File Exchange entries <a href=\"https:\/\/blogs.mathworks.com\/pick\/2012\/04\/13\/what-is-your-favorite-unrecognized-file-exchange-submission\/#respond\">here<\/a>.\r\n   <\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_528ad6a720f74e90975ee77a8750d569() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='528ad6a720f74e90975ee77a8750d569 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 528ad6a720f74e90975ee77a8750d569';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Jiro Doke';\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_528ad6a720f74e90975ee77a8750d569()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.14<br><\/p>\r\n<\/div>\r\n<!--\r\n528ad6a720f74e90975ee77a8750d569 ##### SOURCE BEGIN #####\r\n%%\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/15007\r\n% Jiro>'s pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/9647-grep--a-pedestrian--very-fast-grep-utility grep: a\r\n% pedestrian, very fast grep utility> by\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/4309 Us>.\r\n%\r\n% This week's pick is a recommendation from\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/27420 Yair>,\r\n% who himself is a prominent participant on MATLAB Central. Us has created\r\n% a number of very useful functions over the years, and there's nothing\r\n% pedestrian about his entries!\r\n%\r\n% This week, I was in Seattle presenting at University of Washington, and\r\n% during the seminar I received a question about the best way to scan\r\n% through a file for a certain text pattern that denotes the beginning of\r\n% data. She was scanning the file line by line using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html |textscan|> to\r\n% find the text of interest, and she was wondering if there was a more\r\n% efficient way. The method she described seemed pretty reasonable.\r\n% |textscan| is very efficient in scanning text files, and it's a good\r\n% function for dealing with extremely large files, if you want to read them\r\n% in chunks. Then I remembered Us's |grep|. I remembered Yair had\r\n% <https:\/\/blogs.mathworks.com\/pick\/2012\/04\/13\/what-is-your-favorite-unrecognized-file-exchange-submission\/#comment-14491\r\n% suggested> it, and that there were many positive responses to the entry.\r\n% I suggested that she take a look at the function and to use it in\r\n% conjunction with |textscan| to do the data reading afterwards.\r\n%\r\n% Here's a quick example of how it works. In a folder called \"data_files\",\r\n% I have 10 files, each of which contains experimental data from 100 tests.\r\n% Each test is separated by a line indicating the test number.\r\n%\r\n%     Test  1\r\n%     1.776199 \r\n%     3.552398 \r\n%     5.328597\r\n%       .\r\n%       .\r\n%       .\r\n%     Test  2\r\n%     7.250518\r\n%     4.510056\r\n%     5.797272\r\n%       .\r\n%       .\r\n%       .\r\n%     Test  3\r\n%       .\r\n%       .\r\n%       .\r\n%\r\n% Because each file contains different number of data points, the file\r\n% sizes are different.\r\n\r\nfInfo = dir('data_files\/*.txt');\r\ns = [{fInfo.name}; num2cell([fInfo.bytes]\/2^20)];\r\nfprintf('%s: %4.1f MB\\n', s{:})\r\n\r\n%%\r\n% Let's say that I want to extract data for Test 60. To identify the lines\r\n% where Test 60 starts and ends, we can look for the texts \"Test  60\" and\r\n% \"Test  61\".\r\n\r\ntic;\r\n[fl, p] = grep('-u -n', {'Test  60', 'Test  61'}, 'data_files\/*.txt');\r\ndisp(' '); toc;\r\n\r\n%%\r\n% As you can see from the comments on the File Exchange entry page, the\r\n% function runs extremely efficiently. The outputs from the function\r\n% provide details about the search result, including line numbers. What I\r\n% like most about Us's entry is the extensive HTML help he has on the\r\n% function. He explains all the various options |grep| takes and the\r\n% results structure that it returns, and he includes several examples that\r\n% get you started.\r\n% \r\n% Thanks Us for this great utility and Yair for the recommendation!\r\n%\r\n% *Comments*\r\n%\r\n% If you haven't used this, give it a spin and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=3576#respond here> or leave a\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/9647-grep--a-pedestrian--very-fast-grep-utility#comments\r\n% comment> for Us.\r\n%\r\n% Please keep nominating your favorite File Exchange entries\r\n% <https:\/\/blogs.mathworks.com\/pick\/2012\/04\/13\/what-is-your-favorite-unrecognized-file-exchange-submission\/#respond\r\n% here>.\r\n\r\n##### SOURCE END ##### 528ad6a720f74e90975ee77a8750d569\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n   Jiro's pick this week is grep: a pedestrian, very fast grep utility by Us.\r\n   \r\n   This week's pick is a recommendation from Yair, who himself is a prominent participant on MATLAB Central. Us... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2012\/05\/25\/grep-text-searching-utility\/\">read more >><\/a><\/p>","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/3576"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=3576"}],"version-history":[{"count":15,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/3576\/revisions"}],"predecessor-version":[{"id":3591,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/3576\/revisions\/3591"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=3576"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=3576"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=3576"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}