{"id":803,"date":"2013-11-11T08:05:22","date_gmt":"2013-11-11T13:05:22","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=803"},"modified":"2013-11-11T08:05:22","modified_gmt":"2013-11-11T13:05:22","slug":"in-memory-big-data-analysis-with-pct-and-mdcs","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2013\/11\/11\/in-memory-big-data-analysis-with-pct-and-mdcs\/","title":{"rendered":"In-memory Big Data Analysis with PCT and MDCS"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p><i>Ken Atwell in the MATLAB product management group is guest blogging about using <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/distributed.html\">distributed<\/a><\/tt> arrays to perform data analytics on in-memory \"big data\".<\/i><\/p><p>Big data applications can take many forms.  Unlike unstructured text processing, common in many business and consumer applications, big data in science and engineering fields can often take a more \"regular\" form, either a lengthy data set of observations or time series data, or a monolithic, multi-dimensional matrix capturing some problem space.<\/p><p>64-bit MATLAB is fundamentally well-suited to processing data of these types, as long as the data fits within the memory of your computer.  When data size outgrows the capability of a single computer, <a href=\"https:\/\/www.mathworks.com\/products\/distriben\">MATLAB Distributed Computing Server<\/a> (MDCS) can allow MATLAB applications to simply scale and leverage the aggregate memory and computational capabilities of a cluster of computers.<\/p><p>This blog post explores the use of MDCS to analyze a tabular data set that does not fit into the memory of the typical desktop computer. Highlights include:<\/p><div><ul><li>Importing a large data set sourced from multiple files with parallel file I\/O, and post-processing that data set<\/li><li>Performing parallel computation across the aggregate data set<\/li><li>Visualizing the statistical characteristics of the aggregate data set<\/li><li>Optimizing file I\/O performance with MATLAB MAT-Files<\/li><li>Rebalancing data and workload between workers<\/li><\/ul><\/div><p>You will need <a href=\"https:\/\/www.mathworks.com\/products\/parallel-computing\">Parallel Computing Toolbox<\/a> (PCT) to access the <tt>distributed<\/tt> array, a data type for working with data storage across a cluster.  Its usage is virtually identical to that of a normal MATLAB matrix, supporting easy and rapid adoption.  The in-memory nature of the distributed array facilitates experimentation and the rapid iteration workflows that MATLAB users have come to expect.<\/p><p>Finally, note that this example has been designed to run in about 20 GB of memory.  That is hardly \"big data\", but the problem size has been purposefully reduced to be small enough to fit into a mid- to high-end workstation, should you not have access to cluster to follow along.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#ed9f38ce-e77a-4e63-8287-c29d06d1ddc1\">Load the Data Set<\/a><\/li><li><a href=\"#5ac8beb2-ef70-47b3-aabf-5e3602f8ab8f\">Identify Files to Process<\/a><\/li><li><a href=\"#df018758-f550-47cd-86a1-0577f07bfbca\">Examine the Data<\/a><\/li><li><a href=\"#5efa19e5-e9ce-46e8-8fb1-9d26fb28477b\">Load One File<\/a><\/li><li><a href=\"#1b0fe155-bf36-42f0-b356-308346cd26ef\">Open a Pool of Local Workers<\/a><\/li><li><a href=\"#7b0e0723-0d37-4cf6-be84-a770a6f604d9\">Load One File to All Workers<\/a><\/li><li><a href=\"#d85234ab-3338-42ba-8343-1509a8b42cc2\">Load a Different File on Each Worker<\/a><\/li><li><a href=\"#b70096c5-581c-47da-bcd2-2b37d87a112c\">Performing Non-cooperative Computation per Worker<\/a><\/li><li><a href=\"#8a264427-7f0c-4bea-afc9-d490624e938b\">Visualize Departure Times<\/a><\/li><li><a href=\"#2eae956f-46c2-4cd5-a8a4-ccedc7155336\">Perform Cooperative Computation per Worker<\/a><\/li><li><a href=\"#fc313250-9dc3-4f6d-b7b3-a77d06abc7a4\">Visualize Year Across the Entire Data Set<\/a><\/li><li><a href=\"#7eb2e4e8-b3eb-420c-ac73-73b5188ff370\">Visualize Departure Time Across the Entire Data Set<\/a><\/li><li><a href=\"#4ef79d0b-9e0e-45c3-8473-29be1640a3ad\">Use 24 Histogram Bins<\/a><\/li><li><a href=\"#6058d57b-a468-40fe-bf22-00e6c50e6a19\">Re-express the Departure Time<\/a><\/li><li><a href=\"#a9d32630-1a09-40ea-a663-ea8bf4a610ed\">Removing Missing Data Points<\/a><\/li><li><a href=\"#b64ba57d-d3ce-46fd-b56f-c0c9058160a8\">Work With the Full Data Set<\/a><\/li><li><a href=\"#52a1fe8e-7435-4e91-8c66-0d0c01008523\">Load the Entire Data Set<\/a><\/li><li><a href=\"#fbbb7e76-bfde-4356-b0a6-051e0704295e\">Rerun Visualizations<\/a><\/li><li><a href=\"#02c70e44-4c1e-4159-840a-bab745fb7db6\">Rebalance Data Among the Pool of Workers<\/a><\/li><li><a href=\"#f5c53fe7-94a9-4249-b129-488eca61c918\">In Closing<\/a><\/li><\/ul><\/div><h4>Load the Data Set<a name=\"ed9f38ce-e77a-4e63-8287-c29d06d1ddc1\"><\/a><\/h4><p>This example uses a publicly available data set of over 100 million airline flight records from over a 20-year time span from 1987 to 2008. Download years individually:<\/p><div><ol><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1987.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1987.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1988.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1988.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1989.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1989.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1990.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1990.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1991.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1991.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1992.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1992.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1993.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1993.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1994.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1994.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1995.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1995.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1996.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1996.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1997.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1997.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1998.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1998.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1999.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1999.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2000.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2000.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2001.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2001.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2002.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2002.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2003.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2003.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2004.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2004.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2005.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2005.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2006.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2006.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2007.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2007.csv<\/a><\/li><li><a href=\"https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2008.csv\">https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2008.csv<\/a><\/li><\/ol><\/div><p>As is typical in many big data applications, data is spread over a collection of medium-to-large files, rather than a single very large file.  This is in fact advantageous, allowing the data to be more easily distributed across a cluster running MDCS.  During this development phase, use a modest pool (8 workers) and only the first eight files in the data set -- the full data set will be seen later in this post.<\/p><h4>Identify Files to Process<a name=\"5ac8beb2-ef70-47b3-aabf-5e3602f8ab8f\"><\/a><\/h4><p>Once the files are downloaded, look for them in MATLAB.  Modify the value of <tt>dataFolder<\/tt> to match wherever you downloaded your files to.<\/p><pre class=\"codeinput\">close <span class=\"string\">all<\/span>\r\nclear <span class=\"string\">all<\/span>\r\ndataFolder = <span class=\"string\">'\/Volumes\/HD2\/airline\/data'<\/span>;\r\nallFiles = dir(fullfile(dataFolder, <span class=\"string\">'*.csv'<\/span>));\r\n<span class=\"comment\">% Remove \"dot\" files (Mac)<\/span>\r\nallFiles = allFiles(~strncmp({allFiles(:).name}, <span class=\"string\">'.'<\/span>, 1));\r\nfiles = allFiles(1:8);  <span class=\"comment\">% For initial development<\/span>\r\n<\/pre><h4>Examine the Data<a name=\"df018758-f550-47cd-86a1-0577f07bfbca\"><\/a><\/h4><p>The data set is comma-separated text.  To keep things simple in this particular example, only read in a few of the columns in the data set. Importing from text is an expensive operation -- reducing the importation to only the needed columns can speed up a MATLAB application considerably.  First look at left-most part of the raw data:<\/p><pre class=\"codeinput\"><span class=\"keyword\">if<\/span> ~ispc\r\n    system([<span class=\"string\">'head -20 '<\/span> fullfile(dataFolder, files(8).name) <span class=\"string\">' | cut -c -60'<\/span>]);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,C\r\n1994,1,7,5,858,900,954,1003,US,227,NA,56,63,NA,-9,-2,CLT,ORF\r\n1994,1,8,6,859,900,952,1003,US,227,NA,53,63,NA,-11,-1,CLT,OR\r\n1994,1,10,1,935,900,1023,1003,US,227,NA,48,63,NA,20,35,CLT,O\r\n1994,1,11,2,903,900,1131,1003,US,227,NA,148,63,NA,88,3,CLT,O\r\n1994,1,12,3,933,900,1024,1003,US,227,NA,51,63,NA,21,33,CLT,O\r\n1994,1,13,4,NA,900,NA,1003,US,227,NA,NA,63,NA,NA,NA,CLT,ORF,\r\n1994,1,14,5,903,900,1005,1003,US,227,NA,62,63,NA,2,3,CLT,ORF\r\n1994,1,15,6,859,900,1004,1003,US,227,NA,65,63,NA,1,-1,CLT,OR\r\n1994,1,17,1,859,900,955,1003,US,227,NA,56,63,NA,-8,-1,CLT,OR\r\n1994,1,18,2,904,900,959,1003,US,227,NA,55,63,NA,-4,4,CLT,ORF\r\n1994,1,19,3,858,900,947,1003,US,227,NA,49,63,NA,-16,-2,CLT,O\r\n1994,1,20,4,923,900,1015,1003,US,227,NA,52,63,NA,12,23,CLT,O\r\n1994,1,21,5,NA,900,NA,1003,US,227,NA,NA,63,NA,NA,NA,CLT,ORF,\r\n1994,1,22,6,859,900,1001,1003,US,227,NA,62,63,NA,-2,-1,CLT,O\r\n1994,1,24,1,901,900,1006,1003,US,227,NA,65,63,NA,3,1,CLT,ORF\r\n1994,1,25,2,859,900,952,1003,US,227,NA,53,63,NA,-11,-1,CLT,O\r\n1994,1,27,4,910,900,1008,1003,US,227,NA,58,63,NA,5,10,CLT,OR\r\n1994,1,28,5,858,900,1020,1003,US,227,NA,82,63,NA,17,-2,CLT,O\r\n1994,1,29,6,859,900,953,1003,US,227,NA,54,63,NA,-10,-1,CLT,O\r\n<\/pre><h4>Load One File<a name=\"5efa19e5-e9ce-46e8-8fb1-9d26fb28477b\"><\/a><\/h4><p>Use the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html\">textscan<\/a><\/tt> function to import the data, as it gives the flexibility needed to import particular columns from a file of mixed text of data; it also can contend with \"NA\" as an indicator of a missing value.  Extract columns 1, 2, 4, and 5 (<tt>Year<\/tt>, <tt>Month<\/tt>, <tt>DayOfWeek<\/tt>, <tt>DepTime<\/tt>) from one file on the local machine.<\/p><pre class=\"codeinput\">tic\r\nsrcFile = fullfile(dataFolder, files(8).name);\r\nf = fopen(srcFile);\r\nC = textscan(f, <span class=\"string\">'%f %f %*f %f %f %*[^\\n]'<\/span>, <span class=\"string\">'Delimiter'<\/span>, <span class=\"string\">','<\/span>, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'HeaderLines'<\/span>, 1, <span class=\"string\">'TreatAsEmpty'<\/span>, <span class=\"string\">'NA'<\/span>);\r\nfclose(f);\r\n[Year, Month, DayOfWeek, DepTime] = C{:};\r\nsprintf(<span class=\"string\">'%g seconds to load \"%s\".\\n'<\/span>, toc, srcFile)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n52.4414 seconds to load \"\/Volumes\/HD2\/airline\/data\/1994.csv\".\r\n\r\n\r\n<\/pre><h4>Open a Pool of Local Workers<a name=\"1b0fe155-bf36-42f0-b356-308346cd26ef\"><\/a><\/h4><p>The previous step probably took a bit of time to import, maybe a full minute.  Luckily, this operation will run in parallel later.  If necessary, open an 8-worker pool now:<\/p><pre class=\"codeinput\"><span class=\"keyword\">try<\/span>\r\n    parpool(<span class=\"string\">'local'<\/span>, 8);\r\n<span class=\"keyword\">catch<\/span>\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">Starting parallel pool (parpool) using the 'local' profile ... connected to 8 workers.\r\n<\/pre><h4>Load One File to All Workers<a name=\"7b0e0723-0d37-4cf6-be84-a770a6f604d9\"><\/a><\/h4><p>Move the <tt>textscan<\/tt> code to the cluster, and have all of the nodes in the cluster load this file.  This is done with a <i>single program\/multiple data<\/i> (SPMD) block.  An <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/spmd.html\">spmd<\/a><\/tt> block of MATLAB code is executed on all eight of the workers in parallel:<\/p><pre class=\"codeinput\">tic\r\n<span class=\"keyword\">spmd<\/span>\r\n    srcFile = fullfile(dataFolder, files(8).name);\r\n    f = fopen(srcFile);\r\n    C = textscan(f, <span class=\"string\">'%f %f %*f %f %f %*[^\\n]'<\/span>, <span class=\"string\">'Delimiter'<\/span>, <span class=\"string\">','<\/span>, <span class=\"keyword\">...<\/span>\r\n        <span class=\"string\">'HeaderLines'<\/span>, 1, <span class=\"string\">'TreatAsEmpty'<\/span>, <span class=\"string\">'NA'<\/span>);\r\n    [Year, Month, DayOfWeek, DepTime] = C{:};\r\n    fclose(f);\r\n<span class=\"keyword\">end<\/span>\r\nsprintf(<span class=\"string\">'%g seconds to load \"%s\" on all workers.\\n'<\/span>, toc, srcFile{1})\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n65.2914 seconds to load \"\/Volumes\/HD2\/airline\/data\/1994.csv\" on all workers.\r\n\r\n\r\n<\/pre><h4>Load a Different File on Each Worker<a name=\"d85234ab-3338-42ba-8343-1509a8b42cc2\"><\/a><\/h4><p>This previous step takes about the same length of  time as the previous load, a near linear speed-up.  But, observe that the same file was loaded eight times. To allow for different behaviors on different workers, each worker is assigned a unique index, called <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/labindex.html\">labindex<\/a><\/tt>.  Use <tt>labindex<\/tt> to make each worker behave differently within an <tt>spmd<\/tt> block.  In this case, by simply replacing the hard-coded index <tt>files(8)<\/tt> with <tt>files(labindex)<\/tt>, the <tt>spmd<\/tt> block loads different files on different workers (recall that <tt>files<\/tt> is a vector of all filenames in the data set) -- this <tt>spmd<\/tt> block is otherwise identical to the previous one.<\/p><pre class=\"codeinput\">tic\r\n<span class=\"keyword\">spmd<\/span>\r\n    srcFile = fullfile(dataFolder, files(labindex).name);\r\n    f = fopen(srcFile);\r\n    C = textscan(f, <span class=\"string\">'%f %f %*f %f %f %*[^\\n]'<\/span>, <span class=\"string\">'Delimiter'<\/span>, <span class=\"string\">','<\/span>, <span class=\"keyword\">...<\/span>\r\n        <span class=\"string\">'HeaderLines'<\/span>, 1, <span class=\"string\">'TreatAsEmpty'<\/span>, <span class=\"string\">'NA'<\/span>);\r\n    [Year, Month, DayOfWeek, DepTime] = C{:};\r\n    fclose(f);\r\n<span class=\"keyword\">end<\/span>\r\nsprintf(<span class=\"string\">'%g seconds to load different files on each worker.\\n'<\/span>, toc)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n61.4875 seconds to load different files on each worker.\r\n\r\n\r\n<\/pre><h4>Performing Non-cooperative Computation per Worker<a name=\"b70096c5-581c-47da-bcd2-2b37d87a112c\"><\/a><\/h4><p>Each worker now has its own set of data.  In some applications, this may be enough, and each data set can be worked on independently from each other.  This is sometimes referred to as an \"embarrassingly parallel\" problem. In the case of this data set, each file is data for an entire year, so calculations by year are trivial to compute.  The value of <tt>Year<\/tt> should be identical within each worker. Confirm this in an <tt>spmd<\/tt> block, and then bring the results of that computation back locally for examination.  Note the use of cell array-style indexing immediately after the <tt>spmd<\/tt> block; this is the means to retrieve all values a variable has on each worker.  Since there are eight workers, eight values are returned -- this is referred to as a <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/composite.html\">Composite<\/a><\/tt> in MATLAB.<\/p><pre class=\"codeinput\"><span class=\"keyword\">spmd<\/span>\r\n    allYearsSame = all(Year==Year(1));\r\n    myYear = Year(1);\r\n<span class=\"keyword\">end<\/span>\r\n\r\n[allYearsSame{:}]\r\n[myYear{:}]\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n     1     1     1     1     1     1     1     1\r\n\r\n\r\nans =\r\n\r\n  Columns 1 through 6\r\n\r\n        1987        1988        1989        1990        1991        1992\r\n\r\n  Columns 7 through 8\r\n\r\n        1993        1994\r\n\r\n<\/pre><h4>Visualize Departure Times<a name=\"8a264427-7f0c-4bea-afc9-d490624e938b\"><\/a><\/h4><p>Create a histogram of departure hour over eight years to get a first real look at the data.  Note that one year (1987) has significantly less magnitude than the other years -- this is because it is only a partial year's worth of data.  A course-grain partitioning of your data using files and folder hierarchy can be a straightforward way of segmenting your data set into meaningful (and manageable) subsets.<\/p><pre class=\"codeinput\"><span class=\"keyword\">spmd<\/span>\r\n    H = hist(DepTime, 24);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nplot(cell2mat(H(:))', <span class=\"string\">'-o'<\/span>)\r\ntitle(<span class=\"string\">'Histogram of departure times by year'<\/span>)\r\nxlabel(<span class=\"string\">'Hour of the day'<\/span>)\r\nylabel(<span class=\"string\">'Number of flights'<\/span>)\r\nlegend(arrayfun(@(x) sprintf(<span class=\"string\">'%d'<\/span>, x), [myYear{:}], <span class=\"string\">'UniformOutput'<\/span>, false), <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'Location'<\/span>, <span class=\"string\">'NorthWest'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_01.png\" alt=\"\"> <h4>Perform Cooperative Computation per Worker<a name=\"2eae956f-46c2-4cd5-a8a4-ccedc7155336\"><\/a><\/h4><p>But suppose analysis must be performed across all observations in the data set. For this, <tt>distributed<\/tt> arrays can be used. A <tt>distributed<\/tt> array allows MATLAB to work with an array spread across many workers as-if it were a single, \"normal\" MATLAB array. <tt>distributed<\/tt> arrays are supported by a large set of MATLAB functions, including linear algebra.<\/p><p>To create a <tt>distributed<\/tt> array from the data already in memory, first determine how big each worker's piece of the array is, and its total size. Calculate the individual and total lengths, and then create four <tt>distributed<\/tt> arrays, one for each column in the original data set, using a so-called <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/codistributor.html\">codistributor<\/a><\/tt> (called a co-distributor because this is from the point of view of each worker; each needs to work cooperatively with other workers).<\/p><pre class=\"codeinput\">tic\r\n<span class=\"keyword\">spmd<\/span>\r\n    lengths = size(Year,1);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nlengths = cell2mat(lengths(:));\r\nlengths = lengths';\r\ntotalLength = sum(lengths);\r\n\r\n<span class=\"keyword\">spmd<\/span>\r\n    codistr = codistributor1d(1, lengths, [totalLength, 1]);\r\n    Year = codistributed.build(Year, codistr);\r\n    Month = codistributed.build(Month, codistr);\r\n    DayOfWeek = codistributed.build(DayOfWeek, codistr);\r\n    DepTime = codistributed.build(DepTime, codistr);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nsprintf(<span class=\"string\">'%g seconds to create distributed arrays from Composites.\\n'<\/span>, toc)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n1.86178 seconds to create distributed arrays from Composites.\r\n\r\n\r\n<\/pre><h4>Visualize Year Across the Entire Data Set<a name=\"fc313250-9dc3-4f6d-b7b3-a77d06abc7a4\"><\/a><\/h4><p>There are now four variables (<tt>Year<\/tt>, <tt>Month<\/tt>, <tt>DayOfWeek<\/tt>, and <tt>DepTime<\/tt>) that contain the entire 8-year data set; these variables can be used much like normal MATLAB variables.  This example is using a 1-D array, but multidimensional <tt>distributed<\/tt> arrays are also supported should you need to distribute \"tiles\" of data across a cluster. If you ever want to gather up the <tt>distributed<\/tt> array into one normal MATLAB array, you can do so with the <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/gather.html\">gather<\/a><\/tt> function.  Be careful when doing this, and ensure that your local computer has enough memory to hold the aggregate content of the entire <tt>distributed<\/tt> array.  To create a histogram of departure year:<\/p><pre class=\"codeinput\">firstYear = gather(min(Year));\r\nlastYear = gather(max(Year));\r\nfigure\r\nhist(Year, firstYear:lastYear, <span class=\"string\">'EdgeColor'<\/span>, <span class=\"string\">'w'<\/span>)\r\ntitle(<span class=\"string\">'Histogram of year'<\/span>)\r\nxlabel(<span class=\"string\">'Year'<\/span>)\r\nylabel(<span class=\"string\">'Number of flights'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_02.png\" alt=\"\"> <p>Note the use of <tt>gather<\/tt> here.  Any operation on a <tt>distributed<\/tt> array produces another <tt>distributed<\/tt> array. Even though <tt>min<\/tt> and <tt>max<\/tt> return small arrays, they must nevertheless be <tt>gather<\/tt>'ed back to your local MATLAB before using them in the second argument to the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/hist.html\">hist<\/a><\/tt> function. You can see more starkly now that 1987 has far fewer observations than the subsequent years.<\/p><h4>Visualize Departure Time Across the Entire Data Set<a name=\"7eb2e4e8-b3eb-420c-ac73-73b5188ff370\"><\/a><\/h4><p>Next produce a histogram of the departure time.  There is no need to gather anything this time because the second argument to <tt>hist<\/tt> (24, as in hours in a day) is a constant value and not something that needs to be computed from the <tt>distributed<\/tt> data.<\/p><pre class=\"codeinput\">figure;\r\nhist(DepTime,24);\r\ntitle(<span class=\"string\">'Histogram of departure time'<\/span>)\r\nxlabel(<span class=\"string\">'Hour of Day'<\/span>)\r\nylabel(<span class=\"string\">'Number of flights'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_03.png\" alt=\"\"> <h4>Use 24 Histogram Bins<a name=\"4ef79d0b-9e0e-45c3-8473-29be1640a3ad\"><\/a><\/h4><p>Readers will notice that the histogram bins run up to 2500, not 25.  This is because the original data set expressed time as an integer encoding HHMM.  To consider only the hour:<\/p><pre class=\"codeinput\">hist(floor(DepTime\/100),24)\r\ntitle(<span class=\"string\">'Histogram of departure time (Hour only)'<\/span>)\r\nxlabel(<span class=\"string\">'Hour of Day'<\/span>)\r\nylabel(<span class=\"string\">'Number of flights'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_04.png\" alt=\"\"> <h4>Re-express the Departure Time<a name=\"6058d57b-a468-40fe-bf22-00e6c50e6a19\"><\/a><\/h4><p>In fact, re-express HHMM as a number between 0 and 24, with the value after the decimal point expressing each minute as 1\/60th. Update the value of the <tt>distributed<\/tt> array and then again plot the histogram with <tt>DepTime<\/tt>'s improved representation.<\/p><pre class=\"codeinput\">clear <span class=\"string\">C<\/span>;\r\nDepTime = floor(DepTime\/100)+mod(DepTime,60)\/60;\r\nhist(DepTime,24)\r\ntitle(<span class=\"string\">'Histogram of departure time (Fractional time)'<\/span>)\r\nxlabel(<span class=\"string\">'Hour of Day'<\/span>)\r\nylabel(<span class=\"string\">'Number of flights'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_05.png\" alt=\"\"> <p>Note the use of <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/clear.html\">clear<\/a><\/tt> to rid the workspace of <tt>C<\/tt>, a temporary created earlier in the code when the file was imported. MATLAB generally avoids keeping unnecessary copies of data around, and keeping the temporary had no material effect on the program up till now. However, since the value of <tt>DepTime<\/tt> will change, MATLAB will be in a position where it needs to keep similar-but-distinct copies of the arrays.  This is to be avoided for obvious reasons, so clear these older \"snapshots\" before modifying <tt>DepTime<\/tt>.<\/p><h4>Removing Missing Data Points<a name=\"a9d32630-1a09-40ea-a663-ea8bf4a610ed\"><\/a><\/h4><p>Up to this point, the data has not been checked for the presence of missing data, represented by not-a-number (NaN) in the departure time. Look for NaNs, and then remove those bad rows from all of the distributed arrays.<\/p><pre class=\"codeinput\">tic\r\ngoodRows = ~isnan(DepTime);\r\nfprintf(<span class=\"string\">'There are %d rows to remove from the data.\\n'<\/span>, <span class=\"keyword\">...<\/span>\r\n    totalLength-gather(sum(goodRows)));\r\nYear = Year(goodRows);\r\nMonth = Month(goodRows);\r\nDayOfWeek = DayOfWeek(goodRows);\r\nDepTime = DepTime(goodRows);\r\n\r\nsprintf(<span class=\"string\">'%g seconds to trim bad rows.\\n'<\/span>, toc)\r\n<\/pre><pre class=\"codeoutput\">There are 419397 rows to remove from the data.\r\n\r\nans =\r\n\r\n23.9631 seconds to trim bad rows.\r\n\r\n\r\n<\/pre><h4>Work With the Full Data Set<a name=\"b64ba57d-d3ce-46fd-b56f-c0c9058160a8\"><\/a><\/h4><p>It is time to work with all of the files in the data set.  Because the data files are text, this application will spend most of its runtime parsing text and converting to numeric.  This may be acceptable for a one-time analysis, but if you anticipate the need to load the data set more than once, it is likely worth the effort to save that numeric representation for future use.  This is done in a simple <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/parfor.html\">parfor<\/a><\/tt> loop, creating a MATLAB MAT-File for each file in the original data set.  This loop takes a few minutes to run the first time.  However, the code is also smart enough to not recreate a MAT-File when it already exists, so this loop finishes quickly when later rerun.<\/p><p>Note: If you have a cluster with more than eight cores available to you, now would be a good time to close the eight-worker pool and open a larger pool.  If you do not have access to a cluster, that is okay too as long as your local computer has around 16 GB of RAM or more.<\/p><pre class=\"codeinput\">clear <span class=\"string\">all<\/span>\r\ndataFolder = <span class=\"string\">'\/Volumes\/HD2\/airline\/data'<\/span>;\r\nfiles = dir(fullfile(dataFolder, <span class=\"string\">'*.csv'<\/span>));\r\nfiles = files(~strncmp({files(:).name}, <span class=\"string\">'.'<\/span>, 1));  <span class=\"comment\">% Remove \"dot\" files (Mac)<\/span>\r\n\r\ntic\r\n<span class=\"keyword\">parfor<\/span> i=1:numel(files)\r\n    <span class=\"comment\">% See https:\/\/www.mathworks.com\/support\/solutions\/en\/data\/1-D8103H for the trick below<\/span>\r\n    parSave = @(fname, Year, Month, DayOfWeek, DepTime) <span class=\"keyword\">...<\/span>\r\n        save(fname, <span class=\"string\">'Year'<\/span>, <span class=\"string\">'Month'<\/span>, <span class=\"string\">'DayOfWeek'<\/span>, <span class=\"string\">'DepTime'<\/span>);\r\n\r\n    srcFile = fullfile(dataFolder, files(i).name);\r\n    [path, name, ~] = fileparts(srcFile);\r\n    cacheFile = fullfile(path, [name <span class=\"string\">'.mat'<\/span>]);\r\n    <span class=\"keyword\">if<\/span> isempty(dir(cacheFile))\r\n        f = fopen(srcFile);\r\n        C = textscan(f, <span class=\"string\">'%f %f %*f %f %f %*[^\\n]'<\/span>, <span class=\"string\">'Delimiter'<\/span>, <span class=\"string\">','<\/span>, <span class=\"keyword\">...<\/span>\r\n            <span class=\"string\">'HeaderLines'<\/span>, 1, <span class=\"string\">'TreatAsEmpty'<\/span>, <span class=\"string\">'NA'<\/span>);\r\n        fclose(f);\r\n        [Year, Month, DayOfWeek, DepTime] = C{:};\r\n        parSave(cacheFile, Year, Month, DayOfWeek, DepTime);\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\n\r\nsprintf(<span class=\"string\">'%g seconds to import text files to save to MAT-Files.\\n'<\/span>, toc)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n220.984 seconds to import text files to save to MAT-Files.\r\n\r\n\r\n<\/pre><h4>Load the Entire Data Set<a name=\"52a1fe8e-7435-4e91-8c66-0d0c01008523\"><\/a><\/h4><p>Now it is time to load the entire data set, which was just saved into MAT-Files.  Since the number of workers may be fewer than the number of files, each worker will be responsible for a collection of files.  Files are assigned to workers using their <tt>labindex<\/tt> and modulo arithmetic. Each worker reads all of its files, and <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/vertcat.html\">vertcat<\/a><\/tt>'s their contents into a single array.  While importing the data, this is a convienent time to perform the processing introduced above -- filtering NaN rows, converting the HHMM representation of departure time, and noting the length of each worker's piece of the data set.  Next, create the <tt>distributed<\/tt> arrays.  Though this section of code is the longest of this post, it is largely a copy-and-paste of code already discussed.<\/p><pre class=\"codeinput\">tic\r\n<span class=\"keyword\">spmd<\/span>\r\n    myIndicies = labindex:numlabs:numel(files);\r\n\r\n    Year = cell(size(myIndicies));\r\n    Month = cell(size(myIndicies));\r\n    DayOfWeek = cell(size(myIndicies));\r\n    DepTime = cell(size(myIndicies));\r\n\r\n    <span class=\"keyword\">for<\/span> i = 1:numel(myIndicies)\r\n        srcFile = fullfile(dataFolder, files(myIndicies(i)).name);\r\n        [path, name, ~] = fileparts(srcFile);\r\n        cacheFile = fullfile(path, [name <span class=\"string\">'.mat'<\/span>]);\r\n        tmpT = load(cacheFile);\r\n        Year{i} = tmpT.Year;\r\n        Month{i} = tmpT.Month;\r\n        DayOfWeek{i} = tmpT.DayOfWeek;\r\n        DepTime{i} = tmpT.DepTime;\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\n\r\nclear <span class=\"string\">tmpT<\/span>  <span class=\"comment\">% As above, avoid deep copies<\/span>\r\n\r\n<span class=\"keyword\">spmd<\/span>\r\n    Year = vertcat(Year{:});\r\n    Month = vertcat(Month{:});\r\n    DayOfWeek = vertcat(DayOfWeek{:});\r\n    DepTime = vertcat(DepTime{:});\r\n\r\n    goodRows = ~isnan(DepTime);\r\n    Year = Year(goodRows);\r\n    Month = Month(goodRows);\r\n    DayOfWeek = DayOfWeek(goodRows);\r\n    DepTime = DepTime(goodRows);\r\n\r\n    DepTime = floor(DepTime\/100)+mod(DepTime,60)\/60;\r\n    lengths = size(Year,1);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nlengths = cell2mat(lengths(:));\r\nlengths = lengths';\r\ntotalLength = sum(lengths);\r\n\r\n<span class=\"keyword\">spmd<\/span>\r\n    codistr = codistributor1d(1, lengths, [totalLength, 1]);\r\n    Year = codistributed.build(Year, codistr);\r\n    Month = codistributed.build(Month, codistr);\r\n    DayOfWeek = codistributed.build(DayOfWeek, codistr);\r\n    DepTime = codistributed.build(DepTime, codistr);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nsprintf(<span class=\"string\">'%g seconds to import from MAT-Files and create distributed arrays.\\n'<\/span>, toc)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n12.121 seconds to import from MAT-Files and create distributed arrays.\r\n\r\n\r\n<\/pre><h4>Rerun Visualizations<a name=\"fbbb7e76-bfde-4356-b0a6-051e0704295e\"><\/a><\/h4><p>You can now perform some of the visualizations over the entire data set. <tt>DayOfWeek<\/tt> is visualized this time.<\/p><pre class=\"codeinput\">figure\r\nhist(DayOfWeek, 1:7)\r\ntitle(<span class=\"string\">'Histogram of day of the week'<\/span>)\r\nxlabel(<span class=\"string\">'Day of the week'<\/span>)\r\nylabel(<span class=\"string\">'Number of flights'<\/span>)\r\nset(gca, <span class=\"string\">'XTickLabel'<\/span>, {<span class=\"string\">'Mon'<\/span>, <span class=\"string\">'Tues'<\/span>, <span class=\"string\">'Wed'<\/span>, <span class=\"string\">'Thurs'<\/span>, <span class=\"string\">'Fri'<\/span>, <span class=\"string\">'Sat'<\/span>, <span class=\"string\">'Sun'<\/span>});\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_06.png\" alt=\"\"> <h4>Rebalance Data Among the Pool of Workers<a name=\"02c70e44-4c1e-4159-840a-bab745fb7db6\"><\/a><\/h4><p>Load balancing is the final topic of discussion.  The <tt>distributed<\/tt> array is partitioned at the file level, with zero or more files per worker. This is probably fine if your input files are roughly the same size and the number of workers is smaller than the number of files.  However, uneven file size may create an imbalance of work, or, even worse, more workers than files will result in completely idle workers! If you are running more than 22 workers, you will see this effect when working with this data set.<\/p><p>It therefore can be helpful to rebalance the distribution of a <tt>distributed<\/tt> array.  This will require some transfer of data between workers, but it may be a worthwhile investment to accelerate subsequent computations.  Calculate the \"ideal\" (balanced) distribution, create a new <tt>codistributor<\/tt> with that ideal distribution, and then <tt><a title=\"https:\/\/www.mathworks.com\/help\/distcomp\/redistribute.html (link no longer works)\">redistribute<\/a><\/tt> using that ideal.<\/p><pre class=\"codeinput\">tic\r\n<span class=\"keyword\">spmd<\/span>\r\n    beforeSizes = size(getLocalPart(Year), 1);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nnewBreaks=round(linspace(0, totalLength, numel(lengths)+1));\r\nnewLengths=newBreaks(2:end)-newBreaks(1:end-1);\r\n\r\n<span class=\"keyword\">spmd<\/span>\r\n    newCodistr = codistributor1d(1, newLengths, [totalLength, 1]);\r\n    Year = redistribute(Year, newCodistr);\r\n    Month = redistribute(Month, newCodistr);\r\n    DayOfWeek = redistribute(DayOfWeek, newCodistr);\r\n    DepTime = redistribute(DepTime, newCodistr);\r\n<span class=\"keyword\">end<\/span>\r\n\r\n<span class=\"keyword\">spmd<\/span>\r\n    afterSizes = size(getLocalPart(Year), 1);\r\n<span class=\"keyword\">end<\/span>\r\n\r\nsprintf(<span class=\"string\">'%g seconds to redistribute the arrays.\\n'<\/span>, toc)\r\n\r\nfigure\r\nbar([[beforeSizes{:}]', [afterSizes{:}]'])\r\ntitle(<span class=\"string\">'Data per worker (Before and After)'<\/span>)\r\nxlabel(<span class=\"string\">'Worker Number'<\/span>)\r\nylabel(<span class=\"string\">'Amount of data'<\/span>)\r\nlegend({<span class=\"string\">'Before'<\/span>, <span class=\"string\">'After'<\/span>}, <span class=\"string\">'Location'<\/span>, <span class=\"string\">'NorthWest'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">\r\nans =\r\n\r\n14.1069 seconds to redistribute the arrays.\r\n\r\n\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_07.png\" alt=\"\"> <h4>In Closing<a name=\"f5c53fe7-94a9-4249-b129-488eca61c918\"><\/a><\/h4><p>Hopefully this paper and code has given you a sense of the kind of interactive workflow that is possible with big data when using a distributed array.  Parallel file I\/O, distributed and cooperative operations, visualization, and load balancing are introduced and explored.  Though the examples used here were simple, these basic principles extend to other data sets similarly distributed among a collection of files.<\/p><p>Significantly, this post only scratched the surface of the actual numeric capabilities of distributed arrays -- higher order operations such as single-value decomposition (SVD) and other linear algebra functions, FFTs, and so on are also available.  For further reading about distributed arrays, try the following:<\/p><div><ul><li><a href=\"https:\/\/www.mathworks.com\/company\/newsletters\/articles\/solving-large-scale-linear-algebra-problems-using-spmd-and-distributed-arrays.html\">Solving Large-Scale Linear Algebra Problems Using SPMD and Distributed Arrays<\/a> for a closer look at the linear algebra capabilities of distributed arrays<\/li><li><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/working-with-codistributed-arrays.html\">Working with Codistributed Arrays<\/a> for more on creating and redistributing distributed arrays<\/li><\/ul><\/div><p>How do you manage your large data analysis?  Let us know <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=803#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_8c4987e217494b05bd50b4e7bb9ac76e() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='8c4987e217494b05bd50b4e7bb9ac76e ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 8c4987e217494b05bd50b4e7bb9ac76e';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2013 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_8c4987e217494b05bd50b4e7bb9ac76e()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013b<br><\/p><\/div><!--\r\n8c4987e217494b05bd50b4e7bb9ac76e ##### SOURCE BEGIN #####\r\n%% In-memory Big Data Analysis with PCT and MDCS\r\n%\r\n% _Ken Atwell in the MATLAB product management group is guest blogging about\r\n% using |<https:\/\/www.mathworks.com\/help\/distcomp\/distributed.html distributed>| arrays to perform data analytics on in-memory \"big\r\n% data\"._\r\n%\r\n% Big data applications can take many forms.  Unlike unstructured text\r\n% processing, common in many business and consumer applications, big data\r\n% in science and engineering fields can often take a more \"regular\" form,\r\n% either a lengthy data set of observations or time series data, or a\r\n% monolithic, multi-dimensional matrix capturing some problem space.\r\n%\r\n% 64-bit MATLAB is fundamentally well-suited to processing data of these\r\n% types, as long as the data fits within the memory of your computer.  When\r\n% data size outgrows the capability of a single computer, <https:\/\/www.mathworks.com\/products\/distriben MATLAB\r\n% Distributed Computing Server> (MDCS) can allow MATLAB applications to simply\r\n% scale and leverage the aggregate memory and computational capabilities of\r\n% a cluster of computers.\r\n%\r\n% This blog post explores the use of MDCS to analyze a tabular data set\r\n% that does not fit into the memory of the typical desktop computer.\r\n% Highlights include:\r\n%\r\n% * Importing a large data set sourced from multiple files with parallel\r\n% file I\/O, and post-processing that data set\r\n% * Performing parallel computation across the aggregate data set\r\n% * Visualizing the statistical characteristics of the aggregate data set\r\n% * Optimizing file I\/O performance with MATLAB MAT-Files\r\n% * Rebalancing data and workload between workers\r\n%\r\n% You will need <https:\/\/www.mathworks.com\/products\/parallel-computing Parallel Computing Toolbox> (PCT) to access the |distributed| array, a data type for\r\n% working with data storage across a cluster.  Its usage is virtually\r\n% identical to that of a normal MATLAB matrix, supporting easy and rapid\r\n% adoption.  The in-memory nature of the distributed array facilitates\r\n% experimentation and the rapid iteration workflows that MATLAB users have\r\n% come to expect.\r\n%\r\n% Finally, note that this example has been designed to run in about 20 GB\r\n% of memory.  That is hardly \"big data\", but the problem size has been\r\n% purposefully reduced to be small enough to fit into a mid- to high-end\r\n% workstation, should you not have access to cluster to follow along.\r\n\r\n%% Load the Data Set\r\n%\r\n% This example uses a publicly available data set of over 100 million\r\n% airline flight records from over a 20-year time span from 1987 to 2008.\r\n% Download years individually:\r\n%\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1987.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1988.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1989.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1990.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1991.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1992.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1993.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1994.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1995.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1996.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1997.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1998.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year1999.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2000.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2001.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2002.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2003.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2004.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2005.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2006.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2007.csv>\r\n% # <https:\/\/s3.amazonaws.com\/h2o-airlines-unpacked\/year2008.csv>\r\n%\r\n% As is typical in many big data applications, data is spread over a\r\n% collection of medium-to-large files, rather than a single very large\r\n% file.  This is in fact advantageous, allowing the data to be more easily\r\n% distributed across a cluster running MDCS.  During this development\r\n% phase, use a modest pool (8 workers) and only the first eight files in\r\n% the data set REPLACE_WITH_DASH_DASH the full data set will be seen later in this post.\r\n\r\n%% Identify Files to Process\r\n%\r\n% Once the files are downloaded, look for them in MATLAB.  Modify the value\r\n% of |dataFolder| to match wherever you downloaded your files to.\r\n\r\nclose all\r\nclear all\r\ndataFolder = '\/Volumes\/HD2\/airline\/data';\r\nallFiles = dir(fullfile(dataFolder, '*.csv'));\r\n% Remove \"dot\" files (Mac)\r\nallFiles = allFiles(~strncmp({allFiles(:).name}, '.', 1));\r\nfiles = allFiles(1:8);  % For initial development\r\n\r\n%% Examine the Data\r\n% The data set is comma-separated text.  To keep things simple in this\r\n% particular example, only read in a few of the columns in the data set.\r\n% Importing from text is an expensive operation REPLACE_WITH_DASH_DASH reducing the importation\r\n% to only the needed columns can speed up a MATLAB application\r\n% considerably.  First look at left-most part of the raw data:\r\n\r\nif ~ispc\r\n    system(['head -20 ' fullfile(dataFolder, files(8).name) ' | cut -c -60']);\r\nend\r\n\r\n%% Load One File\r\n%\r\n% Use the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/textscan.html textscan>| function to import the data, as it gives the\r\n% flexibility needed to import particular columns from a file of mixed text\r\n% of data; it also can contend with \"NA\" as an indicator of a missing\r\n% value.  Extract columns 1, 2, 4, and 5 (|Year|, |Month|, |DayOfWeek|,\r\n% |DepTime|) from one file on the local machine.\r\n\r\ntic\r\nsrcFile = fullfile(dataFolder, files(8).name);\r\nf = fopen(srcFile);\r\nC = textscan(f, '%f %f %*f %f %f %*[^\\n]', 'Delimiter', ',', ...\r\n    'HeaderLines', 1, 'TreatAsEmpty', 'NA');\r\nfclose(f);\r\n[Year, Month, DayOfWeek, DepTime] = C{:};\r\nsprintf('%g seconds to load \"%s\".\\n', toc, srcFile)\r\n\r\n%% Open a Pool of Local Workers\r\n% The previous step probably took a bit of time to import, maybe a full\r\n% minute.  Luckily, this operation will run in parallel later.  If\r\n% necessary, open an 8-worker pool now:\r\n\r\ntry\r\n    parpool('local', 8);\r\ncatch\r\nend\r\n\r\n%% Load One File to All Workers\r\n%\r\n% Move the |textscan| code to the cluster, and have all of the nodes in the\r\n% cluster load this file.  This is done with a _single program\/multiple\r\n% data_ (SPMD) block.  An |<https:\/\/www.mathworks.com\/help\/distcomp\/spmd.html spmd>| block of MATLAB code is executed on all\r\n% eight of the workers in parallel:\r\n\r\ntic\r\nspmd\r\n    srcFile = fullfile(dataFolder, files(8).name);\r\n    f = fopen(srcFile);\r\n    C = textscan(f, '%f %f %*f %f %f %*[^\\n]', 'Delimiter', ',', ...\r\n        'HeaderLines', 1, 'TreatAsEmpty', 'NA');\r\n    [Year, Month, DayOfWeek, DepTime] = C{:};\r\n    fclose(f);\r\nend\r\nsprintf('%g seconds to load \"%s\" on all workers.\\n', toc, srcFile{1})\r\n\r\n%% Load a Different File on Each Worker\r\n%\r\n% This previous step takes about the same length of  time as the previous load,\r\n% a near linear speed-up.  But, observe that the same file was loaded eight times.\r\n% To allow for different behaviors on different workers, each worker is assigned \r\n% a unique index, called |<https:\/\/www.mathworks.com\/help\/distcomp\/labindex.html labindex>|.  Use |labindex| to make each worker behave \r\n% differently within an |spmd| block.  In this case, by simply replacing the\r\n% hard-coded index |files(8)| with |files(labindex)|, the |spmd| block loads\r\n% different files on different workers (recall that |files| is a vector of all\r\n% filenames in the data set) REPLACE_WITH_DASH_DASH this |spmd| block is otherwise identical to the previous one.\r\n\r\ntic\r\nspmd\r\n    srcFile = fullfile(dataFolder, files(labindex).name);\r\n    f = fopen(srcFile);\r\n    C = textscan(f, '%f %f %*f %f %f %*[^\\n]', 'Delimiter', ',', ...\r\n        'HeaderLines', 1, 'TreatAsEmpty', 'NA');\r\n    [Year, Month, DayOfWeek, DepTime] = C{:};\r\n    fclose(f);\r\nend\r\nsprintf('%g seconds to load different files on each worker.\\n', toc)\r\n\r\n%% Performing Non-cooperative Computation per Worker\r\n%\r\n% Each worker now has its own set of data.  In some applications, this may\r\n% be enough, and each data set can be worked on independently from each\r\n% other.  This is sometimes referred to as an \"embarrassingly parallel\"\r\n% problem. In the case of this data set, each file is data for an entire\r\n% year, so calculations by year are trivial to compute.  The value of\r\n% |Year| should be identical within each worker. Confirm this in an |spmd|\r\n% block, and then bring the results of that computation back locally for\r\n% examination.  Note the use of cell array-style indexing immediately after\r\n% the |spmd| block; this is the means to retrieve all values a variable has\r\n% on each worker.  Since there are eight workers, eight values are returned\r\n% REPLACE_WITH_DASH_DASH this is referred to as a |<https:\/\/www.mathworks.com\/help\/distcomp\/composite.html Composite>| in MATLAB.\r\n\r\nspmd\r\n    allYearsSame = all(Year==Year(1));\r\n    myYear = Year(1);\r\nend\r\n\r\n[allYearsSame{:}]\r\n[myYear{:}]\r\n\r\n%% Visualize Departure Times\r\n%\r\n% Create a histogram of departure hour over eight years to get a first\r\n% real look at the data.  Note that one year (1987) has significantly less\r\n% magnitude than the other years REPLACE_WITH_DASH_DASH this is because it is only a partial\r\n% year's worth of data.  A course-grain partitioning of your data using\r\n% files and folder hierarchy can be a straightforward way of segmenting\r\n% your data set into meaningful (and manageable) subsets.\r\n\r\nspmd\r\n    H = hist(DepTime, 24);\r\nend\r\n\r\nplot(cell2mat(H(:))', '-o')\r\ntitle('Histogram of departure times by year')\r\nxlabel('Hour of the day')\r\nylabel('Number of flights')\r\nlegend(arrayfun(@(x) sprintf('%d', x), [myYear{:}], 'UniformOutput', false), ...\r\n    'Location', 'NorthWest')\r\n\r\n%% Perform Cooperative Computation per Worker\r\n%\r\n% But suppose analysis must be performed across all observations in the\r\n% data set. For this, |distributed| arrays can be used. A |distributed|\r\n% array allows MATLAB to work with an array spread across many workers\r\n% as-if it were a single, \"normal\" MATLAB array. |distributed| arrays are\r\n% supported by a large set of MATLAB functions, including linear algebra.\r\n%\r\n% To create a |distributed| array from the data already in memory, first\r\n% determine how big each worker's piece of the array is, and its total\r\n% size. Calculate the individual and total lengths, and then create four\r\n% |distributed| arrays, one for each column in the original data set, using\r\n% a so-called |<https:\/\/www.mathworks.com\/help\/distcomp\/codistributor.html codistributor>| (called a co-distributor because this is from\r\n% the point of view of each worker; each needs to work cooperatively with\r\n% other workers).\r\n\r\ntic\r\nspmd\r\n    lengths = size(Year,1);\r\nend\r\n\r\nlengths = cell2mat(lengths(:));\r\nlengths = lengths';\r\ntotalLength = sum(lengths);\r\n\r\nspmd\r\n    codistr = codistributor1d(1, lengths, [totalLength, 1]);\r\n    Year = codistributed.build(Year, codistr);\r\n    Month = codistributed.build(Month, codistr);\r\n    DayOfWeek = codistributed.build(DayOfWeek, codistr);\r\n    DepTime = codistributed.build(DepTime, codistr);\r\nend\r\n\r\nsprintf('%g seconds to create distributed arrays from Composites.\\n', toc)\r\n\r\n%% Visualize Year Across the Entire Data Set\r\n%\r\n% There are now four variables (|Year|, |Month|, |DayOfWeek|, and\r\n% |DepTime|) that contain the entire 8-year data set; these variables can\r\n% be used much like normal MATLAB variables.  This example is using a 1-D\r\n% array, but multidimensional |distributed| arrays are also supported\r\n% should you need to distribute \"tiles\" of data across a cluster. If you\r\n% ever want to gather up the |distributed| array into one normal MATLAB\r\n% array, you can do so with the |<https:\/\/www.mathworks.com\/help\/distcomp\/gather.html gather>| function.  Be careful when doing\r\n% this, and ensure that your local computer has enough memory to hold the\r\n% aggregate content of the entire |distributed| array.  To create a histogram of\r\n% departure year:\r\n\r\nfirstYear = gather(min(Year));\r\nlastYear = gather(max(Year));\r\nfigure\r\nhist(Year, firstYear:lastYear, 'EdgeColor', 'w')\r\ntitle('Histogram of year')\r\nxlabel('Year')\r\nylabel('Number of flights')\r\n\r\n%%\r\n% Note the use of |gather| here.  Any operation on a |distributed| array\r\n% produces another |distributed| array. Even though |min| and |max| return\r\n% small arrays, they must nevertheless be |gather|'ed back to your local\r\n% MATLAB before using them in the second argument to the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/hist.html hist>| function.\r\n% You can see more starkly now that 1987 has far fewer observations than\r\n% the subsequent years.\r\n\r\n%% Visualize Departure Time Across the Entire Data Set\r\n%\r\n% Next produce a histogram of the departure time.  There is no need to\r\n% gather anything this time because the second argument to |hist| (24, as\r\n% in hours in a day) is a constant value and not something that needs to be\r\n% computed from the |distributed| data.\r\n\r\nfigure;\r\nhist(DepTime,24);\r\ntitle('Histogram of departure time')\r\nxlabel('Hour of Day')\r\nylabel('Number of flights')\r\n\r\n%% Use 24 Histogram Bins\r\n%\r\n% Readers will notice that the histogram bins run up to 2500, not 25.  This\r\n% is because the original data set expressed time as an integer encoding\r\n% HHMM.  To consider only the hour:\r\n\r\nhist(floor(DepTime\/100),24)\r\ntitle('Histogram of departure time (Hour only)')\r\nxlabel('Hour of Day')\r\nylabel('Number of flights')\r\n\r\n%% Re-express the Departure Time\r\n%\r\n% In fact, re-express HHMM as a number between 0 and 24, with the value\r\n% after the decimal point expressing each minute as 1\/60th. Update the\r\n% value of the |distributed| array and then again plot the histogram with\r\n% |DepTime|'s improved representation.\r\n\r\nclear C;\r\nDepTime = floor(DepTime\/100)+mod(DepTime,60)\/60;\r\nhist(DepTime,24)\r\ntitle('Histogram of departure time (Fractional time)')\r\nxlabel('Hour of Day')\r\nylabel('Number of flights')\r\n\r\n%%\r\n% Note the use of |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/clear.html clear>| to rid the workspace of |C|, a temporary created\r\n% earlier in the code when the file was imported. MATLAB generally avoids\r\n% keeping unnecessary copies of data around, and keeping the temporary\r\n% had no material effect on the program up till now. However, since the\r\n% value of |DepTime| will change, MATLAB will be in a position where it\r\n% needs to keep similar-but-distinct copies of the arrays.  This is to be\r\n% avoided for obvious reasons, so clear these older \"snapshots\" before\r\n% modifying |DepTime|.\r\n\r\n\r\n%% Removing Missing Data Points\r\n%\r\n% Up to this point, the data has not been checked for the presence of\r\n% missing data, represented by not-a-number (NaN) in the departure time.\r\n% Look for NaNs, and then remove those bad rows from all of the distributed\r\n% arrays.\r\n\r\ntic\r\ngoodRows = ~isnan(DepTime);\r\nfprintf('There are %d rows to remove from the data.\\n', ...\r\n    totalLength-gather(sum(goodRows)));\r\nYear = Year(goodRows);\r\nMonth = Month(goodRows);\r\nDayOfWeek = DayOfWeek(goodRows);\r\nDepTime = DepTime(goodRows);\r\n\r\nsprintf('%g seconds to trim bad rows.\\n', toc)\r\n\r\n%% Work With the Full Data Set\r\n%\r\n% It is time to work with all of the files in the data set.  Because the\r\n% data files are text, this application will spend most of its runtime\r\n% parsing text and converting to numeric.  This may be acceptable for a\r\n% one-time analysis, but if you anticipate the need to load the data set\r\n% more than once, it is likely worth the effort to save that numeric\r\n% representation for future use.  This is done in a simple |<https:\/\/www.mathworks.com\/help\/distcomp\/parfor.html parfor>| loop,\r\n% creating a MATLAB MAT-File for each file in the original data set.  This\r\n% loop takes a few minutes to run the first time.  However, the code is\r\n% also smart enough to not recreate a MAT-File when it already exists, so\r\n% this loop finishes quickly when later rerun.\r\n%\r\n% Note: If you have a cluster with more than eight cores available to you,\r\n% now would be a good time to close the eight-worker pool and open a larger\r\n% pool.  If you do not have access to a cluster, that is okay too as long as your\r\n% local computer has around 16 GB of RAM or more.\r\n\r\n\r\nclear all\r\ndataFolder = '\/Volumes\/HD2\/airline\/data';\r\nfiles = dir(fullfile(dataFolder, '*.csv'));\r\nfiles = files(~strncmp({files(:).name}, '.', 1));  % Remove \"dot\" files (Mac)\r\n\r\ntic\r\nparfor i=1:numel(files)\r\n    % See https:\/\/www.mathworks.com\/support\/solutions\/en\/data\/1-D8103H for the trick below\r\n    parSave = @(fname, Year, Month, DayOfWeek, DepTime) ...\r\n        save(fname, 'Year', 'Month', 'DayOfWeek', 'DepTime');\r\n    \r\n    srcFile = fullfile(dataFolder, files(i).name);\r\n    [path, name, ~] = fileparts(srcFile);\r\n    cacheFile = fullfile(path, [name '.mat']);\r\n    if isempty(dir(cacheFile))\r\n        f = fopen(srcFile);\r\n        C = textscan(f, '%f %f %*f %f %f %*[^\\n]', 'Delimiter', ',', ...\r\n            'HeaderLines', 1, 'TreatAsEmpty', 'NA');\r\n        fclose(f);\r\n        [Year, Month, DayOfWeek, DepTime] = C{:};\r\n        parSave(cacheFile, Year, Month, DayOfWeek, DepTime);\r\n    end\r\nend\r\n\r\nsprintf('%g seconds to import text files to save to MAT-Files.\\n', toc)\r\n\r\n%% Load the Entire Data Set\r\n%\r\n% Now it is time to load the entire data set, which was just saved into\r\n% MAT-Files.  Since the number of workers may be fewer than the number of\r\n% files, each worker will be responsible for a collection of files.  Files\r\n% are assigned to workers using their |labindex| and modulo arithmetic.\r\n% Each worker reads all of its files, and |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/vertcat.html vertcat>|'s their contents into a\r\n% single array.  While importing the data, this is a convienent time to\r\n% perform the processing introduced above REPLACE_WITH_DASH_DASH filtering NaN rows,\r\n% converting the HHMM representation of departure time, and noting the\r\n% length of each worker's piece of the data set.  Next, create the\r\n% |distributed| arrays.  Though this section of code is the longest of this\r\n% post, it is largely a copy-and-paste of code already discussed.\r\n\r\ntic\r\nspmd\r\n    myIndicies = labindex:numlabs:numel(files);\r\n    \r\n    Year = cell(size(myIndicies));\r\n    Month = cell(size(myIndicies));\r\n    DayOfWeek = cell(size(myIndicies));\r\n    DepTime = cell(size(myIndicies));\r\n    \r\n    for i = 1:numel(myIndicies)\r\n        srcFile = fullfile(dataFolder, files(myIndicies(i)).name);\r\n        [path, name, ~] = fileparts(srcFile);\r\n        cacheFile = fullfile(path, [name '.mat']);\r\n        tmpT = load(cacheFile);\r\n        Year{i} = tmpT.Year;\r\n        Month{i} = tmpT.Month;\r\n        DayOfWeek{i} = tmpT.DayOfWeek;\r\n        DepTime{i} = tmpT.DepTime;\r\n    end\r\nend\r\n\r\nclear tmpT  % As above, avoid deep copies\r\n\r\nspmd\r\n    Year = vertcat(Year{:});\r\n    Month = vertcat(Month{:});\r\n    DayOfWeek = vertcat(DayOfWeek{:});\r\n    DepTime = vertcat(DepTime{:});\r\n    \r\n    goodRows = ~isnan(DepTime);\r\n    Year = Year(goodRows);\r\n    Month = Month(goodRows);\r\n    DayOfWeek = DayOfWeek(goodRows);\r\n    DepTime = DepTime(goodRows);\r\n    \r\n    DepTime = floor(DepTime\/100)+mod(DepTime,60)\/60;\r\n    lengths = size(Year,1);\r\nend\r\n\r\nlengths = cell2mat(lengths(:));\r\nlengths = lengths';\r\ntotalLength = sum(lengths);\r\n\r\nspmd\r\n    codistr = codistributor1d(1, lengths, [totalLength, 1]);\r\n    Year = codistributed.build(Year, codistr);\r\n    Month = codistributed.build(Month, codistr);\r\n    DayOfWeek = codistributed.build(DayOfWeek, codistr);\r\n    DepTime = codistributed.build(DepTime, codistr);\r\nend\r\n\r\nsprintf('%g seconds to import from MAT-Files and create distributed arrays.\\n', toc)\r\n\r\n%% Rerun Visualizations\r\n%\r\n% You can now perform some of the visualizations over the entire data set.\r\n% |DayOfWeek| is visualized this time.\r\n\r\nfigure\r\nhist(DayOfWeek, 1:7)\r\ntitle('Histogram of day of the week')\r\nxlabel('Day of the week')\r\nylabel('Number of flights')\r\nset(gca, 'XTickLabel', {'Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'});\r\n\r\n%% Rebalance Data Among the Pool of Workers\r\n%\r\n% Load balancing is the final topic of discussion.  The |distributed| array\r\n% is partitioned at the file level, with zero or more files per worker.\r\n% This is probably fine if your input files are roughly the same size and\r\n% the number of workers is smaller than the number of files.  However,\r\n% uneven file size may create an imbalance of work, or, even worse, more\r\n% workers than files will result in completely idle workers! If you are\r\n% running more than 22 workers, you will see this effect when working with\r\n% this data set.\r\n%\r\n% It therefore can be helpful to rebalance the distribution of a\r\n% |distributed| array.  This will require some transfer of data between\r\n% workers, but it may be a worthwhile investment to accelerate subsequent\r\n% computations.  Calculate the \"ideal\" (balanced) distribution, create a\r\n% new |codistributor| with that ideal distribution, and then |<https:\/\/www.mathworks.com\/help\/distcomp\/redistribute.html redistribute>|\r\n% using that ideal.\r\n\r\ntic\r\nspmd\r\n    beforeSizes = size(getLocalPart(Year), 1);\r\nend\r\n\r\nnewBreaks=round(linspace(0, totalLength, numel(lengths)+1));\r\nnewLengths=newBreaks(2:end)-newBreaks(1:end-1);\r\n\r\nspmd\r\n    newCodistr = codistributor1d(1, newLengths, [totalLength, 1]);\r\n    Year = redistribute(Year, newCodistr);\r\n    Month = redistribute(Month, newCodistr);\r\n    DayOfWeek = redistribute(DayOfWeek, newCodistr);\r\n    DepTime = redistribute(DepTime, newCodistr);\r\nend\r\n\r\nspmd\r\n    afterSizes = size(getLocalPart(Year), 1);\r\nend\r\n\r\nsprintf('%g seconds to redistribute the arrays.\\n', toc)\r\n\r\nfigure\r\nbar([[beforeSizes{:}]', [afterSizes{:}]'])\r\ntitle('Data per worker (Before and After)')\r\nxlabel('Worker Number')\r\nylabel('Amount of data')\r\nlegend({'Before', 'After'}, 'Location', 'NorthWest')\r\n\r\n%% In Closing\r\n% Hopefully this paper and code has given you a sense of the kind of\r\n% interactive workflow that is possible with big data when using a\r\n% distributed array.  Parallel file I\/O, distributed and cooperative\r\n% operations, visualization, and load balancing are introduced and\r\n% explored.  Though the examples used here were simple, these basic\r\n% principles extend to other data sets similarly distributed among a\r\n% collection of files.\r\n%\r\n% Significantly, this post only scratched the surface of the actual\r\n% numeric capabilities of distributed arrays REPLACE_WITH_DASH_DASH higher order operations such\r\n% as single-value decomposition (SVD) and other linear algebra functions,\r\n% FFTs, and so on are also available.  For further reading about\r\n% distributed arrays, try the following:\r\n%\r\n% * <https:\/\/www.mathworks.com\/company\/newsletters\/articles\/solving-large-scale-linear-algebra-problems-using-spmd-and-distributed-arrays.html Solving Large-Scale Linear Algebra Problems Using SPMD and Distributed\r\n% Arrays> for a closer look at the linear algebra capabilities of\r\n% distributed arrays\r\n% * <https:\/\/www.mathworks.com\/help\/distcomp\/working-with-codistributed-arrays.html Working with Codistributed Arrays> for more on creating and\r\n% redistributing distributed arrays\r\n%\r\n% How do you manage your large data analysis?  Let us know\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=803#respond here>.\r\n\r\n\r\n##### SOURCE END ##### 8c4987e217494b05bd50b4e7bb9ac76e\r\n-->\r\n","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/spmdDataAnalysis_07.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p><i>Ken Atwell in the MATLAB product management group is guest blogging about using <tt><a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/distributed.html\">distributed<\/a><\/tt> arrays to perform data analytics on in-memory \"big data\".<\/i>... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2013\/11\/11\/in-memory-big-data-analysis-with-pct-and-mdcs\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[6,34],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/803"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=803"}],"version-history":[{"count":7,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/803\/revisions"}],"predecessor-version":[{"id":810,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/803\/revisions\/810"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=803"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=803"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=803"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}