File Exchange Pick of the Week

Our best user submissions

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the Original version of the page.

readtext 3

Posted by Robert Bemis,

Bob's pick this week is readtext by Peder Axensten.   Jiro recently highlighted textscantool which can make it much easier to import text data into MATLAB. But you may have encountered data that frustrates you and textscan. I recently analyzed some data I got from a web source as a CSV file. The comma seperated values had single quotes around them all - both string and numeric types. Here's a sample.
type SampleData.csv
'James Murphy','471'
'John Doe, Jr.','44'
'Bill O'Brien','127'

So there are really three kinds of delimiters per line.
  • a leading quote before the first value
  • quote-comma-quote between values
  • a trailing quote after the last value
If you've been doing much text data importing into MATLAB than you probably know that textscan is good but it cannot parse this file correctly.
fid = fopen('SampleData.csv');
data = textscan(fid,'%q%q','delimiter',',')
fclose(fid);
data = 
    {4x1 cell}    {4x1 cell}
See the problem? data should be 3 (not 4) rows. Look closer at column 1.
data{1}
ans = 
    ''James Murphy''
    ''John Doe'
    ''44''
    ''Bill O'Brien''
Now look at column 2.
data{2}
ans = 
    ''471''
    'Jr.''
    ''
    ''127''
Ah. The comma in "John Doe, Jr." was interpretted as a delimiter so "Jr." was taken as the second column value. Then the number "44" dropped to the next line. Also notice that all returned cells are strings - even the numeric values. Moreover, most (but not all) of the cells have those pesky bookend quotes embedded. Yuck! There are lots of ways to solve this problem. readtext by Peder is one of them. In particular, I'm fascinated by the power of using a regular expression based delimiter.
data = readtext('SampleData.csv', '(?m)^''|'',''|''(?m)$')
data = 
     []    'James Murphy'     [471]
     []    'John Doe, Jr.'    [ 44]
     []    'Bill O'Brien'     [127]
The empty first column is an artifact that can easily be suppressed.
data(:,1) = []
data = 
    'James Murphy'     [471]
    'John Doe, Jr.'    [ 44]
    'Bill O'Brien'     [127]
In a word - wow! The embedded comma was no problem. Moreover, first column values are strings and second column values are numbers. In another word - sweet. What's your favorite trick or tool for reading particularly nasty data files? Tell us about it here.

Get the MATLAB code Published with MATLAB® 7.6

Note

Comments are closed.

3 CommentsOldest to Newest

Bob replied on : 1 of 3
Rick, Note that Stuart's textscantool requires R2007a (MATLAB 7.4). In general, suggestions for a particular submission should be made in comments to that submission. That way the author gets notified and your chances of getting a response are greatly improved. :)
Rick Stauf replied on : 2 of 3
I'm a novice at Matlab scripts but this textscantool works fine for our current needs. I just got textscantool to work on a project at work. I had to change the 'verlessthan' to use the 'getversion' instead since 7.1 doesn't recognize verlessthan or I don't have this script in the .m folders. I copied the 'getversion' from your website, I think, and then modified the code to be the following instead: matver = getversion; if (matver < 7.6) % Before 8a I'd like to fine tune the procedure more. I think there may be version compatability issues that would be good to resolve. One issue is with the following line: com.mathworks.mlservices.MLEditorServices.newDocument(str,true); evidently, version 7.1 does not have this newDocument capability. Every time the generate code is launched an error regarding this newDocument is encountered. The other is not a problem just a desire. It would be helpful if the browse window was able to navigate to the same server location each time.
Rick Stauf replied on : 3 of 3
Thanks for your help. The getversion works. Stuart responded to my email so I can now communicate with him directly. I'd like to get his script working for my version of matlab. He gave me some code to fix the problem for 7.1. Thanks for your protocol instruction.