{"id":8599,"date":"2017-05-26T09:00:42","date_gmt":"2017-05-26T13:00:42","guid":{"rendered":"https:\/\/blogs.mathworks.com\/pick\/?p=8599"},"modified":"2017-05-22T10:10:02","modified_gmt":"2017-05-22T14:10:02","slug":"cell2underlying","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/pick\/2017\/05\/26\/cell2underlying\/","title":{"rendered":"Cell2Underlying"},"content":{"rendered":"<div class=\"content\">\n<p><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/3208495\">Sean<\/a>&#8216;s pick this week is <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/61517-cell2underlying-for-tall-arrays\">cell2underlying<\/a> by <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/4855226\">MathWorks Parallel Computing Toolbox Team<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h3>Contents<\/h3>\n<div>\n<ul>\n<li><a href=\"#1\">Datastores<\/a><\/li>\n<li><a href=\"#5\">Tall Arrays<\/a><\/li>\n<li><a href=\"#7\">Cell2Underlying<\/a><\/li>\n<li><a href=\"#10\">Comments<\/a><\/li>\n<\/ul>\n<\/div>\n<h3>Datastores<a name=\"1\"><\/a><\/h3>\n<p>What is a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/datastore.html\"><tt>datastore<\/tt><\/a>? You may have already seen me using one <a href=\"https:\/\/blogs.mathworks.com\/pick\/2017\/04\/28\/the-speech-transmission-index-sti\/\">here<\/a>.<\/p>\n<p>Datastores, are a way to point at a collection of data and describe how they&#8217;re stored. You can create them for text files,<br \/>\nspreadsheets, images, databases, Hadoop, or anything you can write a reader for. This last option is my favorite because<br \/>\nit means I never need to write a kludgy <tt>for<\/tt>-loop over <tt>dir<\/tt> again.<\/p>\n<p>Here&#8217;s a simple example with a directory containing 10 Excel files with fuel economy data:<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">ds = spreadsheetDatastore(<span style=\"color: #a020f0;\">'.\\data\\*.xlsx'<\/span>)<\/pre>\n<pre style=\"font-style: oblique;\">ds = \r\n\r\n  SpreadsheetDatastore with properties:\r\n\r\n                      Files: {\r\n                             'C:\\Documents\\MATLAB\\potw\\Cell2Underlying\\data\\2000dat.xlsx';\r\n                             'C:\\Documents\\MATLAB\\potw\\Cell2Underlying\\data\\2001dat.xlsx';\r\n                             'C:\\Documents\\MATLAB\\potw\\Cell2Underlying\\data\\2002dat.xlsx'\r\n                              ... and 6 more\r\n                             }\r\n                     Sheets: ''\r\n                      Range: ''\r\n\r\n  Sheet Format Properties:\r\n             NumHeaderLines: 0\r\n          ReadVariableNames: true\r\n              VariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more}\r\n              VariableTypes: {'double', 'char', 'char' ... and 21 more}\r\n\r\n  Properties that control the table returned by preview, read, readall:\r\n      SelectedVariableNames: {'Year', 'MfrName', 'CarLine' ... and 21 more}\r\n      SelectedVariableTypes: {'double', 'char', 'char' ... and 21 more}\r\n                   ReadSize: 'file'\r\n\r\n<\/pre>\n<p>At this point, we have loaded no data. We can load it partially with <tt>read<\/tt> or entirely with <tt>readall<\/tt>:<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">T = readall(ds);<\/pre>\n<p>Now I have a table containing all 10 Excel sheets worth of data. Let&#8217;s find the most powerful car:<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">T(T.RatedHP==max(T.RatedHP), {<span style=\"color: #a020f0;\">'MfrName'<\/span>, <span style=\"color: #a020f0;\">'CarLine'<\/span>, <span style=\"color: #a020f0;\">'RatedHP'<\/span>})<\/pre>\n<pre style=\"font-style: oblique;\">ans =\r\n\r\n  2\u00d73 table\r\n\r\n              MfrName               CarLine     RatedHP\r\n    ____________________________    ________    _______\r\n\r\n    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   \r\n    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   \r\n\r\n<\/pre>\n<p>Not too surprising.<\/p>\n<h3>Tall Arrays<a name=\"5\"><\/a><\/h3>\n<p>I was able to read all of those files into MATLAB because they&#8217;re not particularly big and so no issue for the memory of my<br \/>\nlaptop. However, the main design case for datastores is to work with data that are way too big to fit in memory. Sitting<br \/>\non top of a datastore, is something called a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/tall.html\"><tt>tall<\/tt><\/a> array that is an array that lives out of memory but that can be used like any other array in MATLAB. These tall arrays can<br \/>\nthen represent Big Data of the size of whatever your <a href=\"https:\/\/en.wikipedia.org\/wiki\/Binary_prefix\">favorite decimal prefix<\/a> and live locally, on a cluster, cloud or in <a href=\"https:\/\/spark.apache.org\/\">Spark\/Hadoop<\/a>.<\/p>\n<p>Here&#8217;s the same example with a tall array.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">T = tall(ds);\r\ngather(T(T.RatedHP==max(T.RatedHP), {<span style=\"color: #a020f0;\">'MfrName'<\/span>, <span style=\"color: #a020f0;\">'CarLine'<\/span>, <span style=\"color: #a020f0;\">'RatedHP'<\/span>}))<\/pre>\n<pre style=\"font-style: oblique;\">Evaluating tall expression using the Parallel Pool 'local':\r\n- Pass 1 of 2: Completed in 2 sec\r\n- Pass 2 of 2: Completed in 2 sec\r\nEvaluation completed in 6 sec\r\n\r\nans =\r\n\r\n  2\u00d73 table\r\n\r\n              MfrName               CarLine     RatedHP\r\n    ____________________________    ________    _______\r\n\r\n    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   \r\n    'Bugatti Automobiles S.A.S.'    'VEYRON'    1001   \r\n\r\n<\/pre>\n<p>The only difference is the <tt>gather<\/tt> which actually forces MATLAB to hit the disk. Otherwise, it would defer evaluation. This allows other operations to be<br \/>\nqueued up to minimize passes through the data and provide optimal performance.<\/p>\n<h3>Cell2Underlying<a name=\"7\"><\/a><\/h3>\n<p>So where does <tt>cell2underlying<\/tt> play into this? When building a tall array from a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/filedatastore.html\"><tt>fileDatastore<\/tt><\/a>, you get back a cell array with the results for each file. If you want it &#8220;flattened&#8221; into the underlying format, like a<br \/>\ntable for the above use-case, then use <tt>cell2underlying<\/tt>.<\/p>\n<p>Here is an example where I want to build the tall array, but need a custom read function to parse the files created by a hardware<br \/>\ndevice.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">ds = fileDatastore(<span style=\"color: #a020f0;\">'.\\data\\*.txt'<\/span>, <span style=\"color: #a020f0;\">'ReadFcn'<\/span>, @readTestFile);\r\nT =  tall(ds);\r\nTF = cell2underlying(T);\r\nwhos <span style=\"color: #a020f0;\">T<\/span> <span style=\"color: #a020f0;\">TF<\/span>\r\ndisplay(TF)<\/pre>\n<pre style=\"font-style: oblique;\">  Name      Size            Bytes  Class    Attributes\r\n\r\n  T         3x1                25  tall               \r\n  TF        Mx2              1551  tall               \r\n\r\n\r\nTF =\r\n\r\n  M\u00d72 tall table\r\n\r\n    Time    Power\r\n    ____    _____\r\n\r\n    27      2.349\r\n    28      2.349\r\n    29      2.304\r\n    30      2.286\r\n    31      2.286\r\n    32      2.304\r\n    33      2.286\r\n    34      2.286\r\n    :       :\r\n    :       :\r\n\r\n<\/pre>\n<p>With <i>T<\/i>, I would have to work with cell indexing. \u00a0For example, even a simple operation like taking the max becomes this:<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">gather(max(cellfun(@(x)max(x.Power),t)));<\/pre>\n<p>Max is easy because the max is the max, I don&#8217;t have to weigh things by file size like a mean or standard deviation.<\/p>\n<p>With <i>TF<\/i> I can operate directly on the table variables like below. Also note how for three calculations, it only hits the disk once!<br \/>\nThis extends into more complicated operations as well like machine learning algorithms.<\/p>\n<pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid #c8c8c8;\">Pstd = std(TF.Power);\r\nPmean = mean(TF.Power);\r\nPmax = max(TF.Power);\r\n[Pstd, Pmean, Pmax] = gather(Pstd, Pmean, Pmax)<\/pre>\n<pre style=\"font-style: oblique;\">Evaluating tall expression using the Parallel Pool 'local':\r\n- Pass 1 of 1: Completed in 0 sec\r\nEvaluation completed in 1 sec\r\n\r\nPstd =\r\n\r\n   2.613130685159478\r\n\r\n\r\nPmean =\r\n\r\n   3.140150370869870\r\n\r\n\r\nPmax =\r\n\r\n  19.359000000000002\r\n\r\n<\/pre>\n<p>To use <tt>cell2underlying<\/tt>, copy it into the following folder: <tt>[matlabroot '\\toolbox\\matlab\\bigdata\\@tall']<\/tt> and then run <tt>rehash toolboxcache<\/tt>. You will need administrator privileges to copy it in. Alternatively, you can put it in an <tt>@tall<\/tt> folder anywhere on the MATLAB path.<\/p>\n<h3>Comments<a name=\"10\"><\/a><\/h3>\n<p>Give it a try and let us know what you think <a href=\"https:\/\/blogs.mathworks.com\/pick\/?p=8599#respond\">here<\/a> or leave a <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/61517-cell2underlying-for-tall-arrays#comments\">comment<\/a> for The MathWorks Parallel Computing Toolbox Team.<\/p>\n<p><script language=\"JavaScript\">\n<!--\n\n    function grabCode_03f96a93ca2d47e78ba9b77719c6ce22() {\n        \/\/ Remember the title so we can use it in the new page\n        title = document.title;\n\n        \/\/ Break up these strings so that their presence\n        \/\/ in the Javascript doesn't mess up the search for\n        \/\/ the MATLAB code.\n        t1='03f96a93ca2d47e78ba9b77719c6ce22 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 03f96a93ca2d47e78ba9b77719c6ce22';\n    \n        b=document.getElementsByTagName('body')[0];\n        i1=b.innerHTML.indexOf(t1)+t1.length;\n        i2=b.innerHTML.indexOf(t2);\n \n        code_string = b.innerHTML.substring(i1, i2);\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\n\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \n        \/\/ in the XML parser.\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\n        \/\/ doesn't go ahead and substitute the less-than character. \n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\n\n        author = 'Sean de Wolski';\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\n\n        w = window.open();\n        d = w.document;\n        d.write('<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\n<\/p>\n\n\n\n\n<p>\\n');\n      \n      d.title = title + ' (MATLAB code)';\n      d.close();\n      }   \n      \n-->\n<\/script><\/p>\n<p style=\"text-align: right; font-size: xx-small; font-weight: lighter; font-style: italic; color: gray;\"><a><span style=\"font-size: x-small; font-style: italic;\">Get<br \/>\nthe MATLAB code<br \/>\n<noscript>(requires JavaScript)<\/noscript><\/span><\/a><\/p>\n<p>Published with MATLAB\u00ae R2017a<\/p>\n<\/div>\n<p><!--\n03f96a93ca2d47e78ba9b77719c6ce22 ##### SOURCE BEGIN #####\n%% Cell2Underlying\n%\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/3208495 Sean>'s pick this week is\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/61517-cell2underlying-for-tall-arrays cell2underlying> by\n% <https:\/\/www.mathworks.com\/matlabcentral\/profile\/authors\/4855226 MathWorks Parallel Computing Toolbox Team>.\n%\n\n%% Datastores\n%\n% What is a <https:\/\/www.mathworks.com\/help\/matlab\/ref\/datastore.html % |datastore|>?  You may have already seen me using one\n% <https:\/\/blogs.mathworks.com\/pick\/2017\/04\/28\/the-speech-transmission-index-sti\/ % here>.\n%\n% Datastores, are a way to point at a collection of data and describe how\n% they're stored.  You can create them for text files, spreadsheets,\n% images, databases, Hadoop, or anything you can write a reader for.  This\n% last option is my favorite because it means I never need to write a\n% kludgy |for|-loop over |dir| again.\n%\n% Here's a simple example with a directory containing 10 Excel files with\n% fuel economy data:\n\nds = spreadsheetDatastore('.\\data\\*.xlsx')\n\n%%\n% At this point, we have loaded no data.  We can load it partially with\n% |read| or entirely with |readall|:\n\nT = readall(ds);\n\n%%\n% Now I have a table containing all 10 Excel sheets worth of data.\n% Let's find the most powerful car:\n\nT(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'})\n\n%%\n% Not too surprising.\n\n%% Tall Arrays\n%\n% I was able to read all of those files into MATLAB because they're not\n% particularly big and so no issue for the memory of my laptop.  However,\n% the main design case for datastores is to work with data that are way too\n% big to fit in memory.  Sitting on top of a datastore, is something called\n% a <https:\/\/www.mathworks.com\/help\/matlab\/ref\/tall.html |tall|> array that\n% is an array that lives out of memory but that can be used like any other\n% array in MATLAB.  These tall arrays can then represent Big Data of the\n% size of whatever your <https:\/\/en.wikipedia.org\/wiki\/Binary_prefix % favorite decimal prefix> and live locally, on a cluster, cloud or in\n% <https:\/\/spark.apache.org\/ Spark\/Hadoop>.\n%\n% Here's the same example with a tall array.\n\nT = tall(ds);\ngather(T(T.RatedHP==max(T.RatedHP), {'MfrName', 'CarLine', 'RatedHP'}))\n\n%%\n% The only difference is the |gather| which actually forces MATLAB to hit\n% the disk.  Otherwise, it would defer evaluation.  This allows other\n% operations to be queued up to minimize passes through the data and\n% provide optimal performance.\n\n%% Cell2Underlying\n%\n% So where does |cell2underlying| play into this?  When building a tall\n% array from a\n% <https:\/\/www.mathworks.com\/help\/matlab\/ref\/filedatastore.html % |fileDatastore|>, you get back a cell array with the results for each\n% file.  If you want it \"flattened\" into the underlying format, like a\n% table for the above use-case, then use |cell2underlying|.\n%\n% Here is an example where I want to build the tall array, but need a\n% custom read function to parse the files created by a hardware device.\n\nds = fileDatastore('.\\data\\*.txt', 'ReadFcn', @readTestFile);\nT =  tall(ds);\nTF = cell2underlying(T);\nwhos T TF\ndisplay(TF)\n\n%%\n% With _T_, I would have to work with cell indexing.  With _TF_ I can\n% operate directly on the table variables like below.  Also note how for\n% three calculations, it only hits the disk once!  This extends into more\n% complicated operations as well like machine learning algorithms.\n\nPstd = std(TF.Power);\nPmean = mean(TF.Power);\nPmax = max(TF.Power);\n[Pstd, Pmean, Pmax] = gather(Pstd, Pmean, Pmax)\n\n%%\n% To use |cell2underlying|, copy it into the following folder: |[matlabroot\n% '\\toolbox\\matlab\\bigdata\\@tall']| and then run |rehash toolboxcache|. You\n% will need administrator privileges to copy it in.  Alternatively, you can\n% put it in an |@tall| folder anywhere on the MATLAB path.\n\n%% Comments\n%\n% Give it a try and let us know what you think\n% <https:\/\/blogs.mathworks.com\/pick\/?p=8599#respond here> or leave a\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/61517-cell2underlying-for-tall-arrays#comments % comment> for The MathWorks Parallel Computing Toolbox Team.\n%\n\n##### SOURCE END ##### 03f96a93ca2d47e78ba9b77719c6ce22\n--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\nSean&#8216;s pick this week is cell2underlying by MathWorks Parallel Computing Toolbox Team.<br \/>\n&nbsp;<br \/>\nContents<\/p>\n<p>Datastores<br \/>\nTall Arrays<br \/>\nCell2Underlying<br \/>\nComments<\/p>\n<p>Datastores<br \/>\nWhat is a datastore? You&#8230; <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/pick\/2017\/05\/26\/cell2underlying\/\">read more >><\/a><\/p>\n","protected":false},"author":87,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[16],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8599"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/users\/87"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/comments?post=8599"}],"version-history":[{"count":5,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8599\/revisions"}],"predecessor-version":[{"id":8609,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/posts\/8599\/revisions\/8609"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/media?parent=8599"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/categories?post=8599"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/pick\/wp-json\/wp\/v2\/tags?post=8599"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}