File Exchange Pick of the Week

Our best user submissions

readtext

Bob's pick this week is readtext by Peder Axensten.   Jiro recently highlighted textscantool which can make it much easier to import text data into MATLAB. But you may have encountered data that frustrates you and textscan. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around them all - both string and numeric types. Here's a sample.
type SampleData.csv
'James Murphy','471'
'John Doe, Jr.','44'
'Bill O'Brien','127'

So there are really three kinds of delimiters per line.
  • a leading quote before the first value
  • quote-comma-quote between values
  • a trailing quote after the last value
If you've been doing much text data importing into MATLAB than you probably know that textscan is good but it cannot parse this file correctly.
fid = fopen('SampleData.csv');
data = textscan(fid,'%q%q','delimiter',',')
fclose(fid);
data = 
    {4x1 cell}    {4x1 cell}
See the problem? data should be 3 (not 4) rows. Look closer at column 1.
data{1}
ans = 
    ''James Murphy''
    ''John Doe'
    ''44''
    ''Bill O'Brien''
Now look at column 2.
data{2}
ans = 
    ''471''
    'Jr.''
    ''
    ''127''
Ah. The comma in "John Doe, Jr." was interpretted as a delimiter so "Jr." was taken as the second column value. Then the number "44" dropped to the next line. Also notice that all returned cells are strings - even the numeric values. Moreover, most (but not all) of the cells have those pesky bookend quotes embedded. Yuck! There are lots of ways to solve this problem. readtext by Peder is one of them. In particular, I'm fascinated by the power of using a regular expression based delimiter.
data = readtext('SampleData.csv', '(?m)^''|'',''|''(?m)$')
data = 
     []    'James Murphy'     [471]
     []    'John Doe, Jr.'    [ 44]
     []    'Bill O'Brien'     [127]
The empty first column is an artifact that can easily be suppressed.
data(:,1) = []
data = 
    'James Murphy'     [471]
    'John Doe, Jr.'    [ 44]
    'Bill O'Brien'     [127]
In a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are numbers. In another word - sweet. What's your favorite trick or tool for reading particularly nasty data files? Tell us about it here.

Published with MATLAB® 7.6

|
  • print

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.