{"id":2251,"date":"2008-06-13T07:50:51","date_gmt":"2008-06-13T12:50:51","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/2008\/06\/13\/readtext\/"},"modified":"2016-05-11T10:17:47","modified_gmt":"2016-05-11T14:17:47","slug":"readtext","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2008\/06\/13\/readtext\/","title":{"rendered":"readtext"},"content":{"rendered":"<div class=\"content\">\r\n\r\nBob's pick this week is <a title=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/loadFile.do?objectId=10946&amp;objectType=file (link no longer works)\">readtext<\/a> by Peder Axensten.\r\n\r\n&nbsp;\r\n\r\nJiro recently highlighted <tt><a href=\"https:\/\/blogs.mathworks.com\/pick\/2008\/06\/06\/reading-formatted-text\/\">textscantool<\/a><\/tt> which can make it much easier to import text data into MATLAB. But you may have encountered data that frustrates you and\r\n<tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html\">textscan<\/a><\/tt>. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around\r\nthem all - both string and numeric types. Here's a sample.\r\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">type <span style=\"color: #a020f0;\">SampleData.csv<\/span><\/pre>\r\n<pre style=\"font-style: oblique;\">'James Murphy','471'\r\n'John Doe, Jr.','44'\r\n'Bill O'Brien','127'\r\n\r\n<\/pre>\r\nSo there are really three kinds of delimiters per line.\r\n<div>\r\n<ul>\r\n\t<li>a leading quote before the first value<\/li>\r\n\t<li>quote-comma-quote between values<\/li>\r\n\t<li>a trailing quote after the last value<\/li>\r\n<\/ul>\r\n<\/div>\r\nIf you've been doing much text data importing into MATLAB than you probably know that <tt>textscan<\/tt> is good but it cannot parse this file correctly.\r\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">fid = fopen(<span style=\"color: #a020f0;\">'SampleData.csv'<\/span>);\r\ndata = textscan(fid,<span style=\"color: #a020f0;\">'%q%q'<\/span>,<span style=\"color: #a020f0;\">'delimiter'<\/span>,<span style=\"color: #a020f0;\">','<\/span>)\r\nfclose(fid);<\/pre>\r\n<pre style=\"font-style: oblique;\">data = \r\n    {4x1 cell}    {4x1 cell}\r\n<\/pre>\r\nSee the problem? <tt>data<\/tt> should be 3 (not 4) rows. Look closer at column 1.\r\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">data{1}<\/pre>\r\n<pre style=\"font-style: oblique;\">ans = \r\n    ''James Murphy''\r\n    ''John Doe'\r\n    ''44''\r\n    ''Bill O'Brien''\r\n<\/pre>\r\nNow look at column 2.\r\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">data{2}<\/pre>\r\n<pre style=\"font-style: oblique;\">ans = \r\n    ''471''\r\n    'Jr.''\r\n    ''\r\n    ''127''\r\n<\/pre>\r\nAh. The comma in \"John Doe, Jr.\" was interpretted as a delimiter so \"Jr.\" was taken as the second column value. Then the number\r\n\"44\" dropped to the next line. Also notice that all returned cells are strings - even the numeric values. Moreover, most\r\n(but not all) of the cells have those pesky bookend quotes embedded. Yuck!\r\n\r\nThere are lots of ways to solve this problem. <tt>readtext<\/tt> by Peder is one of them. In particular, I'm fascinated by the power of using a regular expression based delimiter.\r\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">data = readtext(<span style=\"color: #a020f0;\">'SampleData.csv'<\/span>, <span style=\"color: #a020f0;\">'(?m)^''|'',''|''(?m)$'<\/span>)<\/pre>\r\n<pre style=\"font-style: oblique;\">data = \r\n     []    'James Murphy'     [471]\r\n     []    'John Doe, Jr.'    [ 44]\r\n     []    'Bill O'Brien'     [127]\r\n<\/pre>\r\nThe empty first column is an artifact that can easily be suppressed.\r\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">data(:,1) = []<\/pre>\r\n<pre style=\"font-style: oblique;\">data = \r\n    'James Murphy'     [471]\r\n    'John Doe, Jr.'    [ 44]\r\n    'Bill O'Brien'     [127]\r\n<\/pre>\r\nIn a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are\r\nnumbers. In another word - sweet.\r\n\r\nWhat's your favorite trick or tool for reading particularly nasty data files? Tell us about it <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=2251#respond\">here<\/a>.\r\n\r\n<script>\/\/ <![CDATA[\r\nfunction grabCode_f680df65a33247ac856be9f4ad68aded() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='f680df65a33247ac856be9f4ad68aded ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' f680df65a33247ac856be9f4ad68aded';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Robert Bemis';\r\n        copyright = 'Copyright 2008 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('\r\n\r\n<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\r\n\r\n\r\n\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }\r\n\/\/ ]]><\/script>\r\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\">\r\n<a><span style=\"font-size: x-small; font-style: italic;\">Get\r\nthe MATLAB code\r\n<noscript>(requires JavaScript)<\/noscript><\/span><\/a>\r\n\r\nPublished with MATLAB\u00ae 7.6<\/p>\r\n\r\n<\/div>\r\n<!--\r\nf680df65a33247ac856be9f4ad68aded ##### SOURCE BEGIN #####\r\n%% readtext\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/loadAuthor.do?objectId=969735&objectType=author % Bob>'s pick this week is\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/loadFile.do?objectId=10946&objectType=file readtext>\r\n% by Peder Axensten.\r\n%%\r\n% Jiro recently highlighted\r\n% |<https:\/\/blogs.mathworks.com\/pick\/2008\/06\/06\/reading-formatted-text\/ textscantoool>|\r\n% which can make it much easier to import text data into MATLAB. But you may\r\n% have encountered data that frustrates you and\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html textscan>|.\r\n% I recently analyzed some data I got from a web source as a CSV file. The\r\n% comma seperated values had single quotes around them all - both string and\r\n% numeric types. Here's a sample.\r\ntype SampleData.csv\r\n%%\r\n% So there are really three kinds of delimiters per line.\r\n%%\r\n% * a leading quote before the first value\r\n% * quote-comma-quote between values\r\n% * a trailing quote after the last value\r\n%%\r\n% If you've been doing much text data importing into MATLAB than you probably\r\n% know that |textscan| is good but it cannot parse this file correctly.\r\nfid = fopen('SampleData.csv');\r\ndata = textscan(fid,'%q%q','delimiter',',')\r\nfclose(fid);\r\n%%\r\n% See the problem? |data| should be 3 (not 4) rows. Look closer at column 1.\r\ndata{1}\r\n%%\r\n% Now look at column 2.\r\ndata{2}\r\n%%\r\n% Ah. The comma in \"John Doe, Jr.\" was interpretted as a delimiter so \"Jr.\"\r\n% was taken as the second column value. Then the number \"44\" dropped to the\r\n% next line. Also notice that all returned cells are strings - even the\r\n% numeric values. Moreover,  most (but not all) of the cells have those pesky\r\n% bookend quotes embedded. Yuck!\r\n%%\r\n% There are lots of ways to solve this problem. |readtext| by Peder is one of\r\n% them. In particular, I'm fascinated by the power of using a\r\n% <https:\/\/www.mathworks.com\/access\/helpdesk\/help\/techdoc\/matlab_prog\/f0-42649.html regular expression>\r\n% based delimiter.\r\ndata = readtext('SampleData.csv', '(?m)^''|'',''|''(?m)$')\r\n%%\r\n% The empty first column is an artifact that can easily be suppressed.\r\ndata(:,1) = []\r\n%%\r\n% In a word - wow! The embedded comma was no problem. Moreover, first column\r\n% values are strings and second column values are numbers. In another word -\r\n% sweet.\r\n%%\r\n% What's your favorite trick or tool for reading particularly nasty data\r\n% files? Tell us about it\r\n% <https:\/\/blogs.mathworks.com\/pick\/?p=2251#respond here>.\r\n\r\n##### SOURCE END ##### f680df65a33247ac856be9f4ad68aded\r\n-->","protected":false},"excerpt":{"rendered":"<p>\r\n\r\nBob's pick this week is readtext by Peder Axensten.\r\n\r\n&nbsp;\r\n\r\nJiro recently highlighted textscantool which can make it much easier to import text data into MATLAB. But you may have encountered... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2008\/06\/13\/readtext\/\">read more >><\/a><\/p>","protected":false},"author":46,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/2251"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/46"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=2251"}],"version-history":[{"count":1,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/2251\/revisions"}],"predecessor-version":[{"id":7149,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/2251\/revisions\/7149"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=2251"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=2251"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=2251"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}