# Cell2Underlying

Sean‘s pick this week is cell2underlying by MathWorks Parallel Computing Toolbox Team.

### Datastores

What is a datastore? You may have already seen me using one here.

Datastores, are a way to point at a collection of data and describe how they’re stored. You can create them for text files,
spreadsheets, images, databases, Hadoop, or anything you can write a reader for. This last option is my favorite because
it means I never need to write a kludgy for-loop over dir again.

Here’s a simple example with a directory containing 10 Excel files with fuel economy data:

ds = spreadsheetDatastore('.\data\*.xlsx')
ds =

Files: {
'C:\Documents\MATLAB\potw\Cell2Underlying\data\2000dat.xlsx';
'C:\Documents\MATLAB\potw\Cell2Underlying\data\2001dat.xlsx';
'C:\Documents\MATLAB\potw\Cell2Underlying\data\2002dat.xlsx'
... and 6 more
}
Sheets: ''
Range: ''

Sheet Format Properties:
VariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more}
VariableTypes: {'double', 'char', 'char' ... and 21 more}

Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more}
SelectedVariableTypes: {'double', 'char', 'char' ... and 21 more}



At this point, we have loaded no data. We can load it partially with read or entirely with readall:

T = readall(ds);

Now I have a table containing all 10 Excel sheets worth of data. Let’s find the most powerful car:

T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'})
ans =

2×3 table

MfrName               CarLine     RatedHP
____________________________    ________    _______

'Bugatti Automobiles S.A.S.'    'VEYRON'    1001
'Bugatti Automobiles S.A.S.'    'VEYRON'    1001



Not too surprising.

### Tall Arrays

I was able to read all of those files into MATLAB because they’re not particularly big and so no issue for the memory of my
laptop. However, the main design case for datastores is to work with data that are way too big to fit in memory. Sitting
on top of a datastore, is something called a tall array that is an array that lives out of memory but that can be used like any other array in MATLAB. These tall arrays can
then represent Big Data of the size of whatever your favorite decimal prefix and live locally, on a cluster, cloud or in Spark/Hadoop.

Here’s the same example with a tall array.

T = tall(ds);
gather(T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'}))
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 2 sec
- Pass 2 of 2: Completed in 2 sec
Evaluation completed in 6 sec

ans =

2×3 table

MfrName               CarLine     RatedHP
____________________________    ________    _______

'Bugatti Automobiles S.A.S.'    'VEYRON'    1001
'Bugatti Automobiles S.A.S.'    'VEYRON'    1001



The only difference is the gather which actually forces MATLAB to hit the disk. Otherwise, it would defer evaluation. This allows other operations to be
queued up to minimize passes through the data and provide optimal performance.

### Cell2Underlying

So where does cell2underlying play into this? When building a tall array from a fileDatastore, you get back a cell array with the results for each file. If you want it “flattened” into the underlying format, like a
table for the above use-case, then use cell2underlying.

Here is an example where I want to build the tall array, but need a custom read function to parse the files created by a hardware
device.

ds = fileDatastore('.\data\*.txt', 'ReadFcn', @readTestFile);
T =  tall(ds);
TF = cell2underlying(T);
whos T TF
display(TF)
  Name      Size            Bytes  Class    Attributes

T         3x1                25  tall
TF        Mx2              1551  tall

TF =

M×2 tall table

Time    Power
____    _____

27      2.349
28      2.349
29      2.304
30      2.286
31      2.286
32      2.304
33      2.286
34      2.286
:       :
:       :



With T, I would have to work with cell indexing.  For example, even a simple operation like taking the max becomes this:

gather(max(cellfun(@(x)max(x.Power),t)));

Max is easy because the max is the max, I don’t have to weigh things by file size like a mean or standard deviation.

With TF I can operate directly on the table variables like below. Also note how for three calculations, it only hits the disk once!
This extends into more complicated operations as well like machine learning algorithms.

Pstd = std(TF.Power);
Pmean = mean(TF.Power);
Pmax = max(TF.Power);
[Pstd, Pmean, Pmax] = gather(Pstd, Pmean, Pmax)
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 0 sec
Evaluation completed in 1 sec

Pstd =

2.613130685159478

Pmean =

3.140150370869870

Pmax =

19.359000000000002



To use cell2underlying, copy it into the following folder: [matlabroot '\toolbox\matlab\bigdata\@tall'] and then run rehash toolboxcache. You will need administrator privileges to copy it in. Alternatively, you can put it in an @tall folder anywhere on the MATLAB path.