Cell2Underlying
Sean‘s pick this week is cell2underlying by MathWorks Parallel Computing Toolbox Team.
Contents
Datastores
What is a datastore? You may have already seen me using one here.
Datastores, are a way to point at a collection of data and describe how they’re stored. You can create them for text files,
spreadsheets, images, databases, Hadoop, or anything you can write a reader for. This last option is my favorite because
it means I never need to write a kludgy for-loop over dir again.
Here’s a simple example with a directory containing 10 Excel files with fuel economy data:
ds = spreadsheetDatastore('.\data\*.xlsx')
ds = SpreadsheetDatastore with properties: Files: { 'C:\Documents\MATLAB\potw\Cell2Underlying\data\2000dat.xlsx'; 'C:\Documents\MATLAB\potw\Cell2Underlying\data\2001dat.xlsx'; 'C:\Documents\MATLAB\potw\Cell2Underlying\data\2002dat.xlsx' ... and 6 more } Sheets: '' Range: '' Sheet Format Properties: NumHeaderLines: 0 ReadVariableNames: true VariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more} VariableTypes: {'double', 'char', 'char' ... and 21 more} Properties that control the table returned by preview, read, readall: SelectedVariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more} SelectedVariableTypes: {'double', 'char', 'char' ... and 21 more} ReadSize: 'file'
At this point, we have loaded no data. We can load it partially with read or entirely with readall:
T = readall(ds);
Now I have a table containing all 10 Excel sheets worth of data. Let’s find the most powerful car:
T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'})
ans = 2×3 table MfrName CarLine RatedHP ____________________________ ________ _______ 'Bugatti Automobiles S.A.S.' 'VEYRON' 1001 'Bugatti Automobiles S.A.S.' 'VEYRON' 1001
Not too surprising.
Tall Arrays
I was able to read all of those files into MATLAB because they’re not particularly big and so no issue for the memory of my
laptop. However, the main design case for datastores is to work with data that are way too big to fit in memory. Sitting
on top of a datastore, is something called a tall array that is an array that lives out of memory but that can be used like any other array in MATLAB. These tall arrays can
then represent Big Data of the size of whatever your favorite decimal prefix and live locally, on a cluster, cloud or in Spark/Hadoop.
Here’s the same example with a tall array.
T = tall(ds); gather(T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'}))
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 2: Completed in 2 sec - Pass 2 of 2: Completed in 2 sec Evaluation completed in 6 sec ans = 2×3 table MfrName CarLine RatedHP ____________________________ ________ _______ 'Bugatti Automobiles S.A.S.' 'VEYRON' 1001 'Bugatti Automobiles S.A.S.' 'VEYRON' 1001
The only difference is the gather which actually forces MATLAB to hit the disk. Otherwise, it would defer evaluation. This allows other operations to be
queued up to minimize passes through the data and provide optimal performance.
Cell2Underlying
So where does cell2underlying play into this? When building a tall array from a fileDatastore, you get back a cell array with the results for each file. If you want it “flattened” into the underlying format, like a
table for the above use-case, then use cell2underlying.
Here is an example where I want to build the tall array, but need a custom read function to parse the files created by a hardware
device.
ds = fileDatastore('.\data\*.txt', 'ReadFcn', @readTestFile); T = tall(ds); TF = cell2underlying(T); whos T TF display(TF)
Name Size Bytes Class Attributes T 3x1 25 tall TF Mx2 1551 tall TF = M×2 tall table Time Power ____ _____ 27 2.349 28 2.349 29 2.304 30 2.286 31 2.286 32 2.304 33 2.286 34 2.286 : : : :
With T, I would have to work with cell indexing. For example, even a simple operation like taking the max becomes this:
gather(max(cellfun(@(x)max(x.Power),t)));
Max is easy because the max is the max, I don’t have to weigh things by file size like a mean or standard deviation.
With TF I can operate directly on the table variables like below. Also note how for three calculations, it only hits the disk once!
This extends into more complicated operations as well like machine learning algorithms.
Pstd = std(TF.Power); Pmean = mean(TF.Power); Pmax = max(TF.Power); [Pstd, Pmean, Pmax] = gather(Pstd, Pmean, Pmax)
Evaluating tall expression using the Parallel Pool 'local': - Pass 1 of 1: Completed in 0 sec Evaluation completed in 1 sec Pstd = 2.613130685159478 Pmean = 3.140150370869870 Pmax = 19.359000000000002
To use cell2underlying, copy it into the following folder: [matlabroot '\toolbox\matlab\bigdata\@tall'] and then run rehash toolboxcache. You will need administrator privileges to copy it in. Alternatively, you can put it in an @tall folder anywhere on the MATLAB path.
Comments
Give it a try and let us know what you think here or leave a comment for The MathWorks Parallel Computing Toolbox Team.
Published with MATLAB® R2017a
- Category:
- Picks
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.