File Exchange Pick of the Week

June 13th, 2008

readtext

Bob's pick this week is readtext by Peder Axensten.

Jiro recently highlighted textscantoool which can make it much easier to import text data into MATLAB. But you may have encountered data that frustrates you and textscan. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around them all - both string and numeric types. Here's a sample.

type SampleData.csv
'James Murphy','471'
'John Doe, Jr.','44'
'Bill O'Brien','127'

So there are really three kinds of delimiters per line.

  • a leading quote before the first value
  • quote-comma-quote between values
  • a trailing quote after the last value

If you've been doing much text data importing into MATLAB than you probably know that textscan is good but it cannot parse this file correctly.

fid = fopen('SampleData.csv');
data = textscan(fid,'%q%q','delimiter',',')
fclose(fid);
data = 
    {4x1 cell}    {4x1 cell}

See the problem? data should be 3 (not 4) rows. Look closer at column 1.

data{1}
ans = 
    ''James Murphy''
    ''John Doe'
    ''44''
    ''Bill O'Brien''

Now look at column 2.

data{2}
ans = 
    ''471''
    'Jr.''
    ''
    ''127''

Ah. The comma in "John Doe, Jr." was interpretted as a delimiter so "Jr." was taken as the second column value. Then the number "44" dropped to the next line. Also notice that all returned cells are strings - even the numeric values. Moreover, most (but not all) of the cells have those pesky bookend quotes embedded. Yuck!

There are lots of ways to solve this problem. readtext by Peder is one of them. In particular, I'm fascinated by the power of using a regular expression based delimiter.

data = readtext('SampleData.csv', '(?m)^''|'',''|''(?m)$')
data = 
     []    'James Murphy'     [471]
     []    'John Doe, Jr.'    [ 44]
     []    'Bill O'Brien'     [127]

The empty first column is an artifact that can easily be suppressed.

data(:,1) = []
data = 
    'James Murphy'     [471]
    'John Doe, Jr.'    [ 44]
    'Bill O'Brien'     [127]

In a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are numbers. In another word - sweet.

What's your favorite trick or tool for reading particularly nasty data files? Tell us about it here.


Get the MATLAB code

Published with MATLAB® 7.6

3 Responses to “readtext”

  1. Rick Stauf replied on :

    I’m a novice at Matlab scripts but this textscantool works
    fine for our current needs.

    I just got textscantool to work on a project at work. I had to change the ‘verlessthan’ to use the ‘getversion’ instead since 7.1 doesn’t recognize verlessthan or I don’t have this script in the .m folders. I copied the ‘getversion’ from your website, I think, and then modified the code to be the following instead:

    matver = getversion;
    if (matver < 7.6) % Before 8a

    I’d like to fine tune the procedure more. I think there may be version compatability issues that would be good to resolve.

    One issue is with the following line:
    com.mathworks.mlservices.MLEditorServices.newDocument(str,true);

    evidently, version 7.1 does not have this newDocument capability. Every time the generate code is launched an error regarding this newDocument is encountered.

    The other is not a problem just a desire. It would be helpful if the browse window was able to navigate to the same server location each time.

  2. Bob replied on :

    Rick, it looks like you are using Tim Davis’ getversion. To get a copy of verLessThan() for older MATLAB versions please see solution 1-38LI61 on the Tech Support web site.

    Also note that Stuart’s textscantool requires R2007a (MATLAB 7.4). In general, suggestions for a particular submission should be made in comments to that submission. That way the author gets notified and your chances of getting a response are greatly improved. :)

  3. Rick Stauf replied on :

    Thanks for your help. The getversion works. Stuart responded to my email so I can now communicate with him directly. I’d like to get his script working for my version of matlab. He gave me some code to fix the problem for 7.1. Thanks for your protocol instruction.

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Bob, Brett & Jiro share their favorite user-contributed submissions from the File Exchange.

  • Zach: Hi Doug and Les, I didn’t have a lot of time to mess with this, but I did find a work-around. I plotted...
  • hamed: k
  • Les: @Zach This isn’t exactly what you are looking for but at least it puts all three parameters on the same...
  • Zach: Thanks for your suggestions Doug. I’ll give that a shot and see what happens. I’ve seen many of...
  • Doug: @Zach, I would say to use plotYYY, because that is close to what you want, but using depth as Y makes sense....
  • Doug: @Teja, I think this will work: http://www.mathworks .com/access/helpdesk /help/techdoc/ref...
  • Gify: merry christmas :) nice christmas tree! Regards, Janet Gify
  • Teja: Dear Doug Is there anyway to plot a surface from nonuniform data without meshgrid and griddata? Basically i...
  • Zach: I’m working with geophysical data, so I’d like to produce a depth profile. The y-axis would be...
  • Doug: @Ashok First, please do not use variable names that are MATLAB commands (std and mean). Second, p(j) should be...

These postings are the author's and don't necessarily represent the opinions of The MathWorks.