{"id":8842,"date":"2017-09-01T09:00:13","date_gmt":"2017-09-01T13:00:13","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=8842"},"modified":"2017-11-03T18:37:46","modified_gmt":"2017-11-03T22:37:46","slug":"reading-the-last-n-lines-from-a-csv-file","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2017\/09\/01\/reading-the-last-n-lines-from-a-csv-file\/","title":{"rendered":"Reading the last N lines from a CSV file"},"content":{"rendered":"\r\n<div class=\"content\"><p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/869871\">Jiro<\/a>&#8216;s pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/64278-csvreadtail\"><tt>csvreadtail<\/tt><\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/4841613\">Mike<\/a>.<\/p><p>Recently, I&#8217;ve been working on customer projects around improving the performance of analysis code. These projects typically involve using the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/profiling-for-improving-performance.html\">Profiler<\/a> to find bottlenecks, <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/vectorization.html\">vectorizing<\/a> code wherever possible, and using the appropriate data types that fit the particular tasks. Regarding the last point, I was pleasantly surprised to find out how string manipulations were so much more efficient using the new <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/create-string-arrays.html\">string data type<\/a> rather than <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/creating-character-arrays.html\">character arrays<\/a> or <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/cell-arrays-of-strings.html\">cell array of chars<\/a>.<\/p><p>In some cases, the bottlenecks I found with the Profiler could be solved by vectorizing or changing the algorithm. But one area that I found a bit challenging was when the bottleneck involved file I\/O. When you need to read or write from\/to files, there&#8217;s not much you can do in terms of speeding up that process.<\/p><p>Well, that&#8217;s not entirely true. With <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\"><tt>readtable<\/tt><\/a> or <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/datastore.html\"><tt>datastore<\/tt><\/a>, you can specify import options to optimize how you import your data. You can do something similar with <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html\"><tt>textscan<\/tt><\/a> by specifying a format spec.<\/p><p>My pick this week falls into this category of efficient file reading. The use case is quite specific, but say you just need to read in the <b>last<\/b> N lines from a large CSV file. If you know beforehand how many lines the files has, <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/csvread.html\"><tt>csvread<\/tt><\/a> has an option for specifying the range. If you don&#8217;t know the number of lines, how can you read the last N lines? Would you need to first scan the file to figure out how many lines it has? Mike&#8217;s <tt>csvreadtail<\/tt> uses a clever technique of moving the file position, using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/fseek.html\"><tt>fseek<\/tt><\/a>, to the end of the file and then &#8220;reading back&#8221; until a specified number of lines have been read. Great idea!<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_csvreadtail\/csvreadtail_screenshot.png\" alt=\"\"> <\/p><p>I do have a few suggestions for improvement, though. I&#8217;m pretty certain these will make this code much more efficient.<\/p><div><ul><li>This one is not about efficiency, but in the <tt>parsecsv.m<\/tt> helper function, <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strsplit.html\"><tt>strsplit<\/tt><\/a> is used to split up a line of text (with commas) into a vector. That command should set the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/strsplit.html#input_argument_d0e965957\"><tt>CollapseDelimiters<\/tt><\/a> options to <tt>false<\/tt>, in order to accurately capture empty entries.<\/li><\/ul><\/div><pre>strsplit(s{iLine},',','CollapseDelimiters',false)<\/pre><div><ul><li>Currently, the way the file position is moved back is by repeatedly calling <tt>fseek<\/tt> relative to the <em>end of file<\/em>. This can be done more efficiently by moving relative to the <em>current position<\/em>. Since the algorithm reads one byte at a time, you can move back 2 bytes from the current position.<\/li><li>Rather than reading one byte at a time, it is more efficient to read a chunk of data (several hundreds or thousands of bytes) at a time. The tricky part is figuring out how many bytes to go back and read. Perhaps one approach is to adaptively change the amount to go back. For example, depending on how many lines of data were read with a single read, you can read more (or less) next time.<\/li><\/ul><\/div><p><b>Tall Arrays<\/b><\/p><p>For those of you on R2016b or newer, you may be interested in checking out the new <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/import_export\/tall-arrays.html\">tall array<\/a> capability. Tall arrays allow you to work with large data sets in out-of-memory fashion, only loading parts of data necessary for analysis. In that framework, the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/tail.html\"><tt>tail<\/tt><\/a> function reads in the last N rows of data.<\/p><p><b>Comments<\/b><\/p><p>Give it a try and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=8842#respond\">here<\/a> or leave a <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/64278-csvreadtail#comment\">comment<\/a> for Mike.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_eecee464713545db90f4f7e1fb74be09() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='eecee464713545db90f4f7e1fb74be09 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' eecee464713545db90f4f7e1fb74be09';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_eecee464713545db90f4f7e1fb74be09()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\neecee464713545db90f4f7e1fb74be09 ##### SOURCE BEGIN #####\r\n%%\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/869871 Jiro>'s\r\n% pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/64278-csvreadtail |csvreadtail|>\r\n% by <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/4841613\r\n% Mike>.\r\n%\r\n% Recently, I've been working on customer projects around improving the\r\n% performance of analysis code. These projects typically involve using the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/profiling-for-improving-performance.html\r\n% Profiler> to find bottlenecks,\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/vectorization.html\r\n% vectorizing> code wherever possible, and using the appropriate data types\r\n% that fit the particular tasks. Regarding the last point, I was pleasantly\r\n% surprised to find out how string manipulations were so much more\r\n% efficient using the new\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/create-string-arrays.html\r\n% string data type> rather than\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/creating-character-arrays.html\r\n% character arrays> or\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/cell-arrays-of-strings.html\r\n% cell array of chars>.\r\n%\r\n% In some cases, the bottlenecks I found with the Profiler could be solved\r\n% by vectorizing or changing the algorithm. But one area that I found a bit\r\n% challenging was when the bottleneck involved file I\/O. When you need to\r\n% read or write from\/to files, there's not much you can do in terms of\r\n% speeding up that process.\r\n%\r\n% Well, that's not entirely true. With\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html |readtable|> or\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/datastore.html |datastore|>,\r\n% you can specify import options to optimize how you import your data. You\r\n% can do something similar with\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html |textscan|> by\r\n% specifying a format spec.\r\n%\r\n% My pick this week falls into this category of efficient file reading. The\r\n% use case is quite specific, but say you just need to read in the *last* N\r\n% lines from a large CSV file. If you know beforehand how many lines the\r\n% files has, <https:\/\/www.mathworks.com\/help\/matlab\/ref\/csvread.html\r\n% |csvread|> has an option for specifying the range. If you don't know the\r\n% number of lines, how can you read the last N lines? Would you need to\r\n% first scan the file to figure out how many lines it has? Mike's\r\n% |csvreadtail| uses a clever technique of moving the file position, using\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/fseek.html |fseek|>, to the\r\n% end of the file and then \"reading back\" until a specified number of lines\r\n% have been read. Great idea!\r\n%\r\n% <<csvreadtail_screenshot.png>>\r\n%\r\n% I do have a few suggestions for improvement, though. I'm pretty certain\r\n% these will make this code much more efficient.\r\n%\r\n% * This one is not about efficiency, but in the |parsecsv.m| helper\r\n% function, <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strsplit.html\r\n% |strsplit|> is used to split up a line of text (with commas) into a\r\n% vector. That command should set the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/strsplit.html#input_argument_d0e965957\r\n% |CollapseDelimiters|> options to |false|, in order to accurately capture\r\n% empty entries.\r\n%\r\n%  strsplit(s{iLine},',','CollapseDelimiters',false)\r\n% \r\n% * Currently, the way the file position is moved back is by repeatedly\r\n% calling |fseek| relative to the _end of file_. This can be done more\r\n% efficiently by moving relative to the _current position_. Since the\r\n% algorithm reads one byte at a time, you can move back 2 bytes from the\r\n% current position.\r\n% * Rather than reading one byte at a time, it is more efficient to read a\r\n% chunk of data (several hundreds or thousands of bytes) at a time. The\r\n% tricky part is figuring out how many bytes to go back and read. Perhaps\r\n% one approach is to adaptively change the amount to go back. For example,\r\n% depending on how many lines of data were read with a single read, you can\r\n% read more (or less) next time.\r\n%\r\n% *Tall Arrays*\r\n%\r\n% For those of you on R2016b or newer, you may be interested in checking\r\n% out the new\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/import_export\/tall-arrays.html\r\n% tall array> capability. Tall arrays allow you to work with large data\r\n% sets in out-of-memory fashion, only loading parts of data necessary for\r\n% analysis. In that framework, the\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/tail.html |tail|> function\r\n% reads in the last N rows of data.\r\n%\r\n% *Comments*\r\n%\r\n% Give it a try and let us know what you think\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=8842#respond here> or leave a\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/64278-csvreadtail#comment\r\n% comment> for Mike.\r\n\r\n##### SOURCE END ##### eecee464713545db90f4f7e1fb74be09\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/pick\/jiro\/potw_csvreadtail\/csvreadtail_screenshot.png\" onError=\"this.style.display ='none';\" \/><\/div><p>\r\nJiro&#8216;s pick this week is csvreadtail by Mike.Recently, I&#8217;ve been working on customer projects around improving the performance of analysis code. These projects typically involve using&#8230; <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2017\/09\/01\/reading-the-last-n-lines-from-a-csv-file\/\">read more >><\/a><\/p>","protected":false},"author":35,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8842"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/35"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=8842"}],"version-history":[{"count":6,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8842\/revisions"}],"predecessor-version":[{"id":8933,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8842\/revisions\/8933"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=8842"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=8842"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=8842"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}