File Exchange Pick of the Week

Our best user submissions

Cell2Underlying

Posted by Sean de Wolski,

Sean‘s pick this week is cell2underlying by MathWorks Parallel Computing Toolbox Team.

 

Contents

Datastores

What is a datastore? You may have already seen me using one here.

Datastores, are a way to point at a collection of data and describe how they’re stored. You can create them for text files,
spreadsheets, images, databases, Hadoop, or anything you can write a reader for. This last option is my favorite because
it means I never need to write a kludgy for-loop over dir again.

Here’s a simple example with a directory containing 10 Excel files with fuel economy data:

ds = spreadsheetDatastore('.\data\*.xlsx')
ds = 

  SpreadsheetDatastore with properties:

                      Files: {
                             'C:\Documents\MATLAB\potw\Cell2Underlying\data\2000dat.xlsx';
                             'C:\Documents\MATLAB\potw\Cell2Underlying\data\2001dat.xlsx';
                             'C:\Documents\MATLAB\potw\Cell2Underlying\data\2002dat.xlsx'
                              ... and 6 more
                             }
                     Sheets: ''
                      Range: ''

  Sheet Format Properties:
             NumHeaderLines: 0
          ReadVariableNames: true
              VariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more}
              VariableTypes: {'double', 'char', 'char' ... and 21 more}

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more}
      SelectedVariableTypes: {'double', 'char', 'char' ... and 21 more}
                   ReadSize: 'file'

At this point, we have loaded no data. We can load it partially with read or entirely with readall:

T = readall(ds);

Now I have a table containing all 10 Excel sheets worth of data. Let’s find the most powerful car:

T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'})
ans =

  2×3 table

              MfrName               CarLine     RatedHP
    ____________________________    ________    _______

    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   
    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   

Not too surprising.

Tall Arrays

I was able to read all of those files into MATLAB because they’re not particularly big and so no issue for the memory of my
laptop. However, the main design case for datastores is to work with data that are way too big to fit in memory. Sitting
on top of a datastore, is something called a tall array that is an array that lives out of memory but that can be used like any other array in MATLAB. These tall arrays can
then represent Big Data of the size of whatever your favorite decimal prefix and live locally, on a cluster, cloud or in Spark/Hadoop.

Here’s the same example with a tall array.

T = tall(ds);
gather(T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'}))
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 2 sec
- Pass 2 of 2: Completed in 2 sec
Evaluation completed in 6 sec

ans =

  2×3 table

              MfrName               CarLine     RatedHP
    ____________________________    ________    _______

    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   
    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   

The only difference is the gather which actually forces MATLAB to hit the disk. Otherwise, it would defer evaluation. This allows other operations to be
queued up to minimize passes through the data and provide optimal performance.

Cell2Underlying

So where does cell2underlying play into this? When building a tall array from a fileDatastore, you get back a cell array with the results for each file. If you want it “flattened” into the underlying format, like a
table for the above use-case, then use cell2underlying.

Here is an example where I want to build the tall array, but need a custom read function to parse the files created by a hardware
device.

ds = fileDatastore('.\data\*.txt', 'ReadFcn', @readTestFile);
T =  tall(ds);
TF = cell2underlying(T);
whos T TF
display(TF)
  Name      Size            Bytes  Class    Attributes

  T         3x1                25  tall               
  TF        Mx2              1551  tall               


TF =

  M×2 tall table

    Time    Power
    ____    _____

    27      2.349
    28      2.349
    29      2.304
    30      2.286
    31      2.286
    32      2.304
    33      2.286
    34      2.286
    :       :
    :       :

With T, I would have to work with cell indexing.  For example, even a simple operation like taking the max becomes this:

gather(max(cellfun(@(x)max(x.Power),t)));

Max is easy because the max is the max, I don’t have to weigh things by file size like a mean or standard deviation.

With TF I can operate directly on the table variables like below. Also note how for three calculations, it only hits the disk once!
This extends into more complicated operations as well like machine learning algorithms.

Pstd = std(TF.Power);
Pmean = mean(TF.Power);
Pmax = max(TF.Power);
[Pstd, Pmean, Pmax] = gather(Pstd, Pmean, Pmax)
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 0 sec
Evaluation completed in 1 sec

Pstd =

   2.613130685159478


Pmean =

   3.140150370869870


Pmax =

  19.359000000000002

To use cell2underlying, copy it into the following folder: [matlabroot '\toolbox\matlab\bigdata\@tall'] and then run rehash toolboxcache. You will need administrator privileges to copy it in. Alternatively, you can put it in an @tall folder anywhere on the MATLAB path.

Comments

Give it a try and let us know what you think here or leave a comment for The MathWorks Parallel Computing Toolbox Team.

Get the MATLAB code

Published with MATLAB® R2017a

Add A Comment

What is 2 + 3?

Preview: hide