Bob's pick this week is readtext by Peder Axensten.
Jiro recently highlighted textscantool which can make it much easier to import text data into MATLAB. But you may have encountered data that frustrates you and
textscan. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around
them all - both string and numeric types. Here's a sample.
So there are really three kinds of delimiters per line.
a leading quote before the first value
quote-comma-quote between values
a trailing quote after the last value
If you've been doing much text data importing into MATLAB than you probably know that textscan is good but it cannot parse this file correctly.
fid = fopen('SampleData.csv');
data = textscan(fid,'%q%q','delimiter',',')
fclose(fid);
data =
{4x1 cell} {4x1 cell}
See the problem? data should be 3 (not 4) rows. Look closer at column 1.
data{1}
ans =
''James Murphy''
''John Doe'
''44''
''Bill O'Brien''
Now look at column 2.
data{2}
ans =
''471''
'Jr.''
''
''127''
Ah. The comma in "John Doe, Jr." was interpretted as a delimiter so "Jr." was taken as the second column value. Then the number
"44" dropped to the next line. Also notice that all returned cells are strings - even the numeric values. Moreover, most
(but not all) of the cells have those pesky bookend quotes embedded. Yuck!
There are lots of ways to solve this problem. readtext by Peder is one of them. In particular, I'm fascinated by the power of using a regular expression based delimiter.
data = readtext('SampleData.csv', '(?m)^''|'',''|''(?m)$')
In a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are
numbers. In another word - sweet.
What's your favorite trick or tool for reading particularly nasty data files? Tell us about it here.
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.