Reading Big Data into MATLAB

Posted by Loren Shure, December 3, 2014

10 views (last 30 days) | 1 Likes | 8 comments

Today I’d like to introduce guest blogger Sarah Wait Zaranek who works for the MATLAB Marketing team here at MathWorks. Sarah previously has written about a variety of topics. Mostly recently, she cowrote a post with me about the new webcam capabilities in MATLAB. Today, Sarah will be discussing datastore, one of the new big data capabilities introduced in MATLAB R2014b.

About the Data
What is a Datastore?
Defining Our Input Data
Creating the DataStore
Preview the Data
Select Data to Import
Adjust Variable Format
Read in First Chunk
Example 1: Read Selected Columns of Data for Use in Memory
Example 2: Filter Data Down to a Subset for Use in Memory
Example 3: Perform Analysis on Chunks of Data and Combine the Results
Extending the Use of Datastore
Conclusion

About the Data

datastore is used for reading data that is too large to fit in memory. For this example, we will be reading in data from the vehicle census of Massachusetts. It is a catalog of information about vehicle registered from 2008 to 2011. The dataset contains information about individual cars registered including vehicle type, location where the vehicle is housed, rated MPG, and measured CO_2 emissions. You can learn more about the data and even download it yourself, here. I have renamed the files in the demo for clarity's sake, but this is where they came from originally.

What is a Datastore?

As mentioned, a datastore is an object useful for reading collections of data that are too large to fit in memory.

Defining Our Input Data

datastore can work with a single file or a collection of files. In this case, we will be reading from a single file. Our file does not include variable names at the top of the file. They are listed in separate header file as defined below.

% Define Data File and Header File
datafile = 'vehiclebig.csv';
headerfile = 'varnames.txt';

% Read in Variable Names
fileID = fopen(headerfile);
varnames = textscan(fileID,'%s');
varnames = varnames{:};
fclose(fileID);

Creating the DataStore

We can now create our datastore by giving the name of the data file as the import to the datastore function. We also specify our datastore not use the first row of our file as variable names. We will set those variable names explicitly using the names found in the 'varnames.txt' file.

ds = datastore(datafile,'ReadVariableNames',false);

% Set Variable Names
ds.VariableNames = varnames

ds = 
  TabularTextDatastore with properties:

                      Files: {
                             'H:\Documents\LOREN\MyJob\Art of MATLAB\SarahZ\datastore\vehiclebig.csv'
                             }
          ReadVariableNames: false
              VariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%f', '%f', '%f' ... and 42 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}
            SelectedFormats: {'%f', '%f', '%f' ... and 42 more}
                RowsPerRead: 20000

Note that we haven't read in our data yet. We have just provided an easy way to access it through ds, our datastore.

Preview the Data

A really nice thing about a datastore is that you can preview your data without having to load it all into memory. datastore reads the data into a table which is a data type in MATLAB designed to work well with tabular data.

data = preview(ds);
whos data

data(:,1:7) % Look at first 7 variables

  Name      Size            Bytes  Class    Attributes

  data      8x45            21426  table              

ans = 
    record_id    vin_id     plate_id       me_id       owner_type    start_odom     start_date 
    _________    ______    __________    __________    __________    __________    ____________
     2           1         5.3466e+06    1.2801e+07    1               NaN         '2011-11-07'
     4           1         5.3466e+06    1.1499e+07    1               NaN         '2009-08-27'
     7           2         6.6148e+06    1.2801e+07    1               NaN         '2011-11-19'
     9           2         6.6148e+06    1.1499e+07    1               NaN         '2008-07-01'
    10           3         6.4173e+06    1.2801e+07    1               NaN         '2011-12-06'
     1           1         5.3466e+06             1    1             30490         '2009-09-01'
     3           1         5.3466e+06             2    1             55155         '2010-10-02'
     5           2         6.6148e+06             3    1                 5         '2008-07-02'

By default, datastore will read in every column of our dataset. datastore makes an educated guess for the appropriate format for each column (variable) of our data. We can, however, specify a subset of columns or different formats if we wish.

Select Data to Import

We can specify which variables (columns) by setting the SelectedVariableNames property of our datastore. In this case, we only want to bring in 5 columns out of the 45.

ds.SelectedVariableNames = {'model_year', 'veh_type', ...
    'curbwt','mpg_adj','hybrid'};

preview(ds)

ans = 
    model_year    veh_type    curbwt    mpg_adj    hybrid
    __________    ________    ______    _______    ______
    2008          'Car'       3500      21.65      0     
    2008          'Car'       3500      22.54      0     
    2008          'SUV'       4500         16      0     
    2008          'SUV'       4500         17      0     
    2005          'Truck'     5000      13.29      0     
    2008          'Car'       3500      22.09      0     
    2008          'Car'       3500      21.65      0     
    2008          'SUV'       4500      16.66      0

Adjust Variable Format

We can adjust the format of the data we wish to access by using the SelectedFormats property. We can specify to bring in the vehicle type as a categorical variable by using the %C specifier. You can learn more here about the benefits of using categorical arrays.

ds.SelectedFormats;
ds.SelectedFormats{2} = '%C'  % read  in as a categorical

ds = 
  TabularTextDatastore with properties:

                      Files: {
                             'H:\Documents\LOREN\MyJob\Art of MATLAB\SarahZ\datastore\vehiclebig.csv'
                             }
          ReadVariableNames: false
              VariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%*q', '%*q', '%*q' ... and 42 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'model_year', 'veh_type', 'curbwt' ... and 2 more}
            SelectedFormats: {'%f', '%C', '%f' ... and 2 more}
                RowsPerRead: 20000

Read in First Chunk

We can use the read function to read in a chunk of our data. By default, read reads in 20000 rows at a time. This value can be adjusted using the RowsPerRead property.

testdata = read(ds);
whos testdata

  Name              Size             Bytes  Class    Attributes

  testdata      20000x5             683152  table

After you read in a chunk, you can use the hasdata function to see if there is still additional data available to read from the datastore.

hasdata(ds)

ans =
     1

By using hasdata and read in a while loop with your datastore, you can read in your entire dataset a piece at a time. We will put in a counter just to track how many read operations took place in our loop.

counter = 0;

while hasdata(ds)
    % Read in Chunk
    dataChunk = read(ds);
    counter = counter + 1;
end

counter

counter =
   825

By using reset, we can reset our datastore and start reading at the beginning of the file.

reset(ds)

Now that we see how to get started using datastore, let's look at 3 different ways to use it to work with a dataset that does not entirely fit in the memory of your machine.

Example 1: Read Selected Columns of Data for Use in Memory

If you are interested in only processing certain columns of your text file and those columns can fit in memory, you can use datastore to bring in those particular columns from your text file. Then, you can work with that data directly in memory. In this example, we are only interested in the model year and vehicle type of the cars that were registered. We can use readall instead of read to import all the selected data instead of just a chunk of it at a time.

ds.SelectedVariableNames = {'model_year', 'veh_type'};

cardata = readall(ds);
whos cardata

  Name                Size                Bytes  Class    Attributes

  cardata      16145383x2             161456272  table

Now that you have the data read into MATLAB, you can work with it like you would normally work with your data in MATLAB. For this example, we will just use the new histogram function introduced in R2014b to look at the distribution of vehicle model years registered.

figure
histogram(cardata.model_year)
hold on
histogram(cardata.model_year(cardata.veh_type == 'Car'))
hold off

xlabel('Model Year')
legend({'all vehicles', 'only cars'},'Location','southwest')

Example 2: Filter Data Down to a Subset for Use in Memory

Another way to subset your data is to filter the data down a chunk at time. Using datastore you can read in a chunk of data and keep only data you need from that chunk. You then continue this process, chunk by chunk, until you reach the end of the file and have only the subset of the data you want to use.

In this case, we want to extract a subset of data for cars that were registered in 2011. The new variables we are loading in (e.g., q1_2011), contain either a one or zero. Ones represent valid car registrations during that time. So we only save the rows which contain a valid registration sometime in 2011 and discard the rest.

reset(ds)
ds.SelectedVariableNames = {'model_year','veh_type',...
    'q1_2011','q2_2011','q3_2011','q4_2011'};

data2011 = table;

while hasdata(ds)

    % Read in Chunk
    dataChunk = read(ds);

    % Find if Valid During Any Quarter in 2011
    reg2011 = sum(dataChunk{:,3:end},2);

    % Extract Data to Keep (Cars Registered in 2011)
    idx = reg2011 >= 1 & dataChunk.veh_type == 'Car';

    % Save to Final Table
    data2011 = [data2011; dataChunk(idx,1:2)];

end

whos data2011

figure
histogram(data2011.model_year)
xlabel('Model Year')
legend({'cars registered in 2011'})

  Name                Size               Bytes  Class    Attributes

  data2011      3503434x2             35036782  table

Example 3: Perform Analysis on Chunks of Data and Combine the Results

But what if we can't hold the subset of the data we are interested in analyzing in memory? We could instead process the data a section at a time and then combine intermediate results to get a final result. In this case, let's look at the % of hybrid cars registered every quarter (in terms of total cars registered). So, we compute a running total of the number of cars registered per quarter as well as the number of hybrids registered per quarter. Then, we calculate the final % when we have read through the entire dataset.

% Reset Datastore
reset(ds)

% Select Data to Import
quarterNames = varnames(end-15:end);
ds.SelectedVariableNames = [{'veh_type', 'hybrid'} quarterNames'];

% Read in Vehicle Type as a Categorical Variable
ds.SelectedFormats{1} = '%C';

totalCars = zeros(length(quarterNames),1);
totalHybrids = zeros(length(quarterNames),1);

while hasdata(ds)

    % Read in Chunk
    dataChunk = read(ds);

    for ii = 1:length(quarterNames) % Loop over car model years

    % Extract Data
    idx = dataChunk{:,quarterNames(ii)}== 1 & dataChunk.veh_type == 'Car';
    idxHy = idx & dataChunk.hybrid == 1;

    % Perform Calculation
    totalCarsChunk = sum(idx);
    totalHybridsChunk = sum(idxHy);

    % Save Result
    totalCars(ii) = totalCarsChunk + totalCars(ii);
    totalHybrids(ii) = totalHybridsChunk + totalHybrids(ii);

    end
end

percentHybrid = (totalHybrids./totalCars)*100;

figure
scatter(1:length(percentHybrid),percentHybrid,'filled')
xlabel('Inspection Year')
ylabel('% Hybrids')

% Label tick axes
ax = gca;
ax.TickLabelInterpreter = 'none';
ax.XTick = 1:length(quarterNames);
ax.XTickLabel = quarterNames;
ax.XTickLabelRotation = -45;

Extending the Use of Datastore

You can also use datastore as the first step to creating and running your own MapReduce algorithms in MATLAB. Perhaps this topic will be a blog post in the future.