{"id":1063,"date":"2014-12-03T16:29:11","date_gmt":"2014-12-03T21:29:11","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=1063"},"modified":"2021-10-04T08:33:30","modified_gmt":"2021-10-04T12:33:30","slug":"reading-big-data-into-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2014\/12\/03\/reading-big-data-into-matlab\/","title":{"rendered":"Reading Big Data into MATLAB"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>Today I&#8217;d like to introduce guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team here at MathWorks. Sarah previously has written about a variety of topics.  Mostly recently, she cowrote a post with me about the <a href=\"https:\/\/blogs.mathworks.com\/loren\/2014\/06\/17\/webcam-support-new-in-r2014a\">new webcam capabilities<\/a> in MATLAB. Today, Sarah will be discussing <tt>datastore<\/tt>, one of the new big data capabilities introduced in MATLAB R2014b.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#c8bbcc26-4ed9-48ed-bf9b-f11b7c5d2e03\">About the Data<\/a><\/li><li><a href=\"#7be657c9-9796-4abb-95aa-399697359602\">What is a Datastore?<\/a><\/li><li><a href=\"#53f9b60d-ea5f-48fb-a456-81f35395b6cc\">Defining Our Input Data<\/a><\/li><li><a href=\"#ba3eabf6-f81b-49db-b545-58767d302f39\">Creating the DataStore<\/a><\/li><li><a href=\"#c47b9ad8-0dd0-4618-ac62-33c7ba923a97\">Preview the Data<\/a><\/li><li><a href=\"#1fb7b78b-b6e3-4523-a008-906806499780\">Select Data to Import<\/a><\/li><li><a href=\"#8a26002b-ec1c-4799-9b07-54f0f535fd59\">Adjust Variable Format<\/a><\/li><li><a href=\"#4983375b-c112-402f-b9a4-2aeffacf526a\">Read in First Chunk<\/a><\/li><li><a href=\"#ccb3d979-bede-4b1e-9875-a1a1579b3431\">Example 1: Read Selected Columns of Data for Use in Memory<\/a><\/li><li><a href=\"#b1d17a68-470f-460c-9d94-3c2d1b0b219f\">Example 2: Filter Data Down to a Subset for Use in Memory<\/a><\/li><li><a href=\"#3954fced-bb52-479f-b51a-1c872ba564e8\">Example 3:  Perform Analysis on Chunks of Data and Combine the Results<\/a><\/li><li><a href=\"#2198cc21-e255-4d22-9da4-7177a9e5b11f\">Extending the Use of Datastore<\/a><\/li><li><a href=\"#e9856795-1c05-4a1d-b706-07322a431ace\">Conclusion<\/a><\/li><\/ul><\/div><h4>About the Data<a name=\"c8bbcc26-4ed9-48ed-bf9b-f11b7c5d2e03\"><\/a><\/h4><p><tt>datastore<\/tt> is used for reading data that is too large to fit in memory. For this example, we will be reading in data from the vehicle census of Massachusetts. It is a catalog of information about vehicle registered from 2008 to 2011.  The dataset contains information about individual cars registered including vehicle type, location where the vehicle is housed, rated MPG, and measured CO_2 emissions. You can learn more about the data and even download it yourself, <a title=\"http:\/\/www.37billionmilechallenge.org\/ (link no longer works)\">here<\/a>.  I have renamed the files in the demo for clarity's sake, but this is where they came from originally.<\/p><h4>What is a Datastore?<a name=\"7be657c9-9796-4abb-95aa-399697359602\"><\/a><\/h4><p>As mentioned, a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/import_export\/what-is-a-datastore.html\">datastore<\/a> is an object useful for reading collections of data that are too large to fit in memory.<\/p><h4>Defining Our Input Data<a name=\"53f9b60d-ea5f-48fb-a456-81f35395b6cc\"><\/a><\/h4><p><tt>datastore<\/tt> can work with a single file or a collection of files. In this case, we will be reading from a single file. Our file does not include variable names at the top of the file.  They are listed in separate header file as defined below.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Define Data File and Header File<\/span>\r\ndatafile = <span class=\"string\">'vehiclebig.csv'<\/span>;\r\nheaderfile = <span class=\"string\">'varnames.txt'<\/span>;\r\n\r\n<span class=\"comment\">% Read in Variable Names<\/span>\r\nfileID = fopen(headerfile);\r\nvarnames = textscan(fileID,<span class=\"string\">'%s'<\/span>);\r\nvarnames = varnames{:};\r\nfclose(fileID);\r\n<\/pre><h4>Creating the DataStore<a name=\"ba3eabf6-f81b-49db-b545-58767d302f39\"><\/a><\/h4><p>We can now create our datastore by giving the name of the data file as the import to the <tt>datastore<\/tt> function.  We also specify our datastore not use the first row of our file as variable names.  We will set those variable names explicitly using the names found in the 'varnames.txt' file.<\/p><pre class=\"codeinput\">ds = datastore(datafile,<span class=\"string\">'ReadVariableNames'<\/span>,false);\r\n\r\n<span class=\"comment\">% Set Variable Names<\/span>\r\nds.VariableNames = varnames\r\n<\/pre><pre class=\"codeoutput\">ds = \r\n  TabularTextDatastore with properties:\r\n\r\n                      Files: {\r\n                             'H:\\Documents\\LOREN\\MyJob\\Art of MATLAB\\SarahZ\\datastore\\vehiclebig.csv'\r\n                             }\r\n          ReadVariableNames: false\r\n              VariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}\r\n\r\n  Text Format Properties:\r\n             NumHeaderLines: 0\r\n                  Delimiter: ','\r\n               RowDelimiter: '\\r\\n'\r\n             TreatAsMissing: ''\r\n               MissingValue: NaN\r\n\r\n  Advanced Text Format Properties:\r\n            TextscanFormats: {'%f', '%f', '%f' ... and 42 more}\r\n         ExponentCharacters: 'eEdD'\r\n               CommentStyle: ''\r\n                 Whitespace: ' \\b\\t'\r\n    MultipleDelimitersAsOne: false\r\n\r\n  Properties that control the table returned by preview, read, readall:\r\n      SelectedVariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}\r\n            SelectedFormats: {'%f', '%f', '%f' ... and 42 more}\r\n                RowsPerRead: 20000\r\n\r\n<\/pre><p>Note that we haven't read in our data yet.  We have just provided an easy way to access it through <tt>ds<\/tt>, our datastore.<\/p><h4>Preview the Data<a name=\"c47b9ad8-0dd0-4618-ac62-33c7ba923a97\"><\/a><\/h4><p>A really nice thing about a datastore is that you can preview your data without having to load it all into memory.  <tt>datastore<\/tt> reads the data into a <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/tables.html\">table<\/a> which is a data type in MATLAB designed to work well with tabular data.<\/p><pre class=\"codeinput\">data = preview(ds);\r\nwhos <span class=\"string\">data<\/span>\r\n\r\ndata(:,1:7) <span class=\"comment\">% Look at first 7 variables<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name      Size            Bytes  Class    Attributes\r\n\r\n  data      8x45            21426  table              \r\n\r\nans = \r\n    record_id    vin_id     plate_id       me_id       owner_type    start_odom     start_date \r\n    _________    ______    __________    __________    __________    __________    ____________\r\n     2           1         5.3466e+06    1.2801e+07    1               NaN         '2011-11-07'\r\n     4           1         5.3466e+06    1.1499e+07    1               NaN         '2009-08-27'\r\n     7           2         6.6148e+06    1.2801e+07    1               NaN         '2011-11-19'\r\n     9           2         6.6148e+06    1.1499e+07    1               NaN         '2008-07-01'\r\n    10           3         6.4173e+06    1.2801e+07    1               NaN         '2011-12-06'\r\n     1           1         5.3466e+06             1    1             30490         '2009-09-01'\r\n     3           1         5.3466e+06             2    1             55155         '2010-10-02'\r\n     5           2         6.6148e+06             3    1                 5         '2008-07-02'\r\n<\/pre><p>By default, <tt>datastore<\/tt> will read in every column of our dataset. <tt>datastore<\/tt> makes an educated guess for the appropriate format for each column (variable) of our data.  We can, however, specify a subset of columns or different formats if we wish.<\/p><h4>Select Data to Import<a name=\"1fb7b78b-b6e3-4523-a008-906806499780\"><\/a><\/h4><p>We can specify which variables (columns) by setting the <tt>SelectedVariableNames<\/tt> property of our datastore.  In this case, we only want to bring in 5 columns out of the 45.<\/p><pre class=\"codeinput\">ds.SelectedVariableNames = {<span class=\"string\">'model_year'<\/span>, <span class=\"string\">'veh_type'<\/span>, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'curbwt'<\/span>,<span class=\"string\">'mpg_adj'<\/span>,<span class=\"string\">'hybrid'<\/span>};\r\n\r\npreview(ds)\r\n<\/pre><pre class=\"codeoutput\">ans = \r\n    model_year    veh_type    curbwt    mpg_adj    hybrid\r\n    __________    ________    ______    _______    ______\r\n    2008          'Car'       3500      21.65      0     \r\n    2008          'Car'       3500      22.54      0     \r\n    2008          'SUV'       4500         16      0     \r\n    2008          'SUV'       4500         17      0     \r\n    2005          'Truck'     5000      13.29      0     \r\n    2008          'Car'       3500      22.09      0     \r\n    2008          'Car'       3500      21.65      0     \r\n    2008          'SUV'       4500      16.66      0     \r\n<\/pre><h4>Adjust Variable Format<a name=\"8a26002b-ec1c-4799-9b07-54f0f535fd59\"><\/a><\/h4><p>We can adjust the format of the data we wish to access by using the <tt>SelectedFormats<\/tt> property.  We can specify to bring in the vehicle type as a categorical variable by using the <tt>%C<\/tt> specifier. You can learn more <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/advantages-of-using-categorical-arrays.html\">here<\/a> about the benefits of using categorical arrays.<\/p><pre class=\"codeinput\">ds.SelectedFormats;\r\nds.SelectedFormats{2} = <span class=\"string\">'%C'<\/span>  <span class=\"comment\">% read  in as a categorical<\/span>\r\n<\/pre><pre class=\"codeoutput\">ds = \r\n  TabularTextDatastore with properties:\r\n\r\n                      Files: {\r\n                             'H:\\Documents\\LOREN\\MyJob\\Art of MATLAB\\SarahZ\\datastore\\vehiclebig.csv'\r\n                             }\r\n          ReadVariableNames: false\r\n              VariableNames: {'record_id', 'vin_id', 'plate_id' ... and 42 more}\r\n\r\n  Text Format Properties:\r\n             NumHeaderLines: 0\r\n                  Delimiter: ','\r\n               RowDelimiter: '\\r\\n'\r\n             TreatAsMissing: ''\r\n               MissingValue: NaN\r\n\r\n  Advanced Text Format Properties:\r\n            TextscanFormats: {'%*q', '%*q', '%*q' ... and 42 more}\r\n         ExponentCharacters: 'eEdD'\r\n               CommentStyle: ''\r\n                 Whitespace: ' \\b\\t'\r\n    MultipleDelimitersAsOne: false\r\n\r\n  Properties that control the table returned by preview, read, readall:\r\n      SelectedVariableNames: {'model_year', 'veh_type', 'curbwt' ... and 2 more}\r\n            SelectedFormats: {'%f', '%C', '%f' ... and 2 more}\r\n                RowsPerRead: 20000\r\n\r\n<\/pre><h4>Read in First Chunk<a name=\"4983375b-c112-402f-b9a4-2aeffacf526a\"><\/a><\/h4><p>We can use the <tt>read<\/tt> function to read in a chunk of our data. By default, <tt>read<\/tt> reads in 20000 rows at a time.  This value can be adjusted using the <tt>RowsPerRead<\/tt> property.<\/p><pre class=\"codeinput\">testdata = read(ds);\r\nwhos <span class=\"string\">testdata<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name              Size             Bytes  Class    Attributes\r\n\r\n  testdata      20000x5             683152  table              \r\n\r\n<\/pre><p>After you read in a chunk, you can use the <tt>hasdata<\/tt> function to see if there is still additional data available to read from the datastore.<\/p><pre class=\"codeinput\">hasdata(ds)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n     1\r\n<\/pre><p>By using <tt>hasdata<\/tt> and <tt>read<\/tt> in a <tt>while<\/tt> loop with your <tt>datastore<\/tt>, you can read in your entire dataset a piece at a time. We will put in a counter just to track how many read operations took place in our loop.<\/p><pre class=\"codeinput\">counter = 0;\r\n\r\n<span class=\"keyword\">while<\/span> hasdata(ds)\r\n    <span class=\"comment\">% Read in Chunk<\/span>\r\n    dataChunk = read(ds);\r\n    counter = counter + 1;\r\n<span class=\"keyword\">end<\/span>\r\n\r\ncounter\r\n<\/pre><pre class=\"codeoutput\">counter =\r\n   825\r\n<\/pre><p>By using <tt>reset<\/tt>, we can reset our datastore and start reading at the beginning of the file.<\/p><pre class=\"codeinput\">reset(ds)\r\n<\/pre><p>Now that we see how to get started using datastore, let's look at 3 different ways to use it to work with a dataset that does not entirely fit in the memory of your machine.<\/p><h4>Example 1: Read Selected Columns of Data for Use in Memory<a name=\"ccb3d979-bede-4b1e-9875-a1a1579b3431\"><\/a><\/h4><p>If you are interested in only processing certain columns of your text file and those columns can fit in memory, you can use <tt>datastore<\/tt> to bring in those particular columns from your text file. Then, you can work with that data directly in memory.  In this example, we are only interested in the model year and vehicle type of the cars that were registered. We can use <tt>readall<\/tt> instead of <tt>read<\/tt> to import all the selected data instead of just a chunk of it at a time.<\/p><pre class=\"codeinput\">ds.SelectedVariableNames = {<span class=\"string\">'model_year'<\/span>, <span class=\"string\">'veh_type'<\/span>};\r\n\r\ncardata = readall(ds);\r\nwhos <span class=\"string\">cardata<\/span>\r\n<\/pre><pre class=\"codeoutput\">  Name                Size                Bytes  Class    Attributes\r\n\r\n  cardata      16145383x2             161456272  table              \r\n\r\n<\/pre><p>Now that you have the data read into MATLAB, you can work with it like you would normally work with your data in MATLAB. For this example, we will just use the new <tt>histogram<\/tt> function introduced in R2014b to look at the distribution of vehicle model years registered.<\/p><pre class=\"codeinput\">figure\r\nhistogram(cardata.model_year)\r\nhold <span class=\"string\">on<\/span>\r\nhistogram(cardata.model_year(cardata.veh_type == <span class=\"string\">'Car'<\/span>))\r\nhold <span class=\"string\">off<\/span>\r\n\r\nxlabel(<span class=\"string\">'Model Year'<\/span>)\r\nlegend({<span class=\"string\">'all vehicles'<\/span>, <span class=\"string\">'only cars'<\/span>},<span class=\"string\">'Location'<\/span>,<span class=\"string\">'southwest'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/intro_datastore_blog_01.png\" alt=\"\"> <h4>Example 2: Filter Data Down to a Subset for Use in Memory<a name=\"b1d17a68-470f-460c-9d94-3c2d1b0b219f\"><\/a><\/h4><p>Another way to subset your data is to filter the data down a chunk at time. Using <tt>datastore<\/tt> you can read in a chunk of data and keep only data you need from that chunk. You then continue this process, chunk by chunk, until you reach the end of the file and have only the subset of the data you want to use.<\/p><p>In this case, we want to extract a subset of data for cars that were registered in 2011.  The new variables we are loading in (e.g., q1_2011), contain either a one or zero.  Ones represent valid car registrations during that time.  So we only save the rows which contain a valid registration sometime in 2011 and discard the rest.<\/p><pre class=\"codeinput\">reset(ds)\r\nds.SelectedVariableNames = {<span class=\"string\">'model_year'<\/span>,<span class=\"string\">'veh_type'<\/span>,<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'q1_2011'<\/span>,<span class=\"string\">'q2_2011'<\/span>,<span class=\"string\">'q3_2011'<\/span>,<span class=\"string\">'q4_2011'<\/span>};\r\n\r\ndata2011 = table;\r\n\r\n<span class=\"keyword\">while<\/span> hasdata(ds)\r\n\r\n    <span class=\"comment\">% Read in Chunk<\/span>\r\n    dataChunk = read(ds);\r\n\r\n    <span class=\"comment\">% Find if Valid During Any Quarter in 2011<\/span>\r\n    reg2011 = sum(dataChunk{:,3:end},2);\r\n\r\n    <span class=\"comment\">% Extract Data to Keep (Cars Registered in 2011)<\/span>\r\n    idx = reg2011 &gt;= 1 &amp; dataChunk.veh_type == <span class=\"string\">'Car'<\/span>;\r\n\r\n    <span class=\"comment\">% Save to Final Table<\/span>\r\n    data2011 = [data2011; dataChunk(idx,1:2)];\r\n\r\n<span class=\"keyword\">end<\/span>\r\n\r\nwhos <span class=\"string\">data2011<\/span>\r\n\r\nfigure\r\nhistogram(data2011.model_year)\r\nxlabel(<span class=\"string\">'Model Year'<\/span>)\r\nlegend({<span class=\"string\">'cars registered in 2011'<\/span>})\r\n<\/pre><pre class=\"codeoutput\">  Name                Size               Bytes  Class    Attributes\r\n\r\n  data2011      3503434x2             35036782  table              \r\n\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/intro_datastore_blog_02.png\" alt=\"\"> <h4>Example 3:  Perform Analysis on Chunks of Data and Combine the Results<a name=\"3954fced-bb52-479f-b51a-1c872ba564e8\"><\/a><\/h4><p>But what if we can't hold the subset of the data we are interested in analyzing in memory? We could instead process the data a section at a time and then combine intermediate results to get a final result. In this case, let's look at the % of hybrid cars registered every quarter (in terms of total cars registered). So, we compute a running total of the number of cars registered per quarter as well as the number of hybrids registered per quarter. Then, we calculate the final % when we have read through the entire dataset.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Reset Datastore<\/span>\r\nreset(ds)\r\n\r\n<span class=\"comment\">% Select Data to Import<\/span>\r\nquarterNames = varnames(end-15:end);\r\nds.SelectedVariableNames = [{<span class=\"string\">'veh_type'<\/span>, <span class=\"string\">'hybrid'<\/span>} quarterNames'];\r\n\r\n<span class=\"comment\">% Read in Vehicle Type as a Categorical Variable<\/span>\r\nds.SelectedFormats{1} = <span class=\"string\">'%C'<\/span>;\r\n\r\ntotalCars = zeros(length(quarterNames),1);\r\ntotalHybrids = zeros(length(quarterNames),1);\r\n\r\n<span class=\"keyword\">while<\/span> hasdata(ds)\r\n\r\n    <span class=\"comment\">% Read in Chunk<\/span>\r\n    dataChunk = read(ds);\r\n\r\n    <span class=\"keyword\">for<\/span> ii = 1:length(quarterNames) <span class=\"comment\">% Loop over car model years<\/span>\r\n\r\n    <span class=\"comment\">% Extract Data<\/span>\r\n    idx = dataChunk{:,quarterNames(ii)}== 1 &amp; dataChunk.veh_type == <span class=\"string\">'Car'<\/span>;\r\n    idxHy = idx &amp; dataChunk.hybrid == 1;\r\n\r\n    <span class=\"comment\">% Perform Calculation<\/span>\r\n    totalCarsChunk = sum(idx);\r\n    totalHybridsChunk = sum(idxHy);\r\n\r\n    <span class=\"comment\">% Save Result<\/span>\r\n    totalCars(ii) = totalCarsChunk + totalCars(ii);\r\n    totalHybrids(ii) = totalHybridsChunk + totalHybrids(ii);\r\n\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeinput\">percentHybrid = (totalHybrids.\/totalCars)*100;\r\n\r\nfigure\r\nscatter(1:length(percentHybrid),percentHybrid,<span class=\"string\">'filled'<\/span>)\r\nxlabel(<span class=\"string\">'Inspection Year'<\/span>)\r\nylabel(<span class=\"string\">'% Hybrids'<\/span>)\r\n\r\n<span class=\"comment\">% Label tick axes<\/span>\r\nax = gca;\r\nax.TickLabelInterpreter = <span class=\"string\">'none'<\/span>;\r\nax.XTick = 1:length(quarterNames);\r\nax.XTickLabel = quarterNames;\r\nax.XTickLabelRotation = -45;\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/intro_datastore_blog_03.png\" alt=\"\"> <h4>Extending the Use of Datastore<a name=\"2198cc21-e255-4d22-9da4-7177a9e5b11f\"><\/a><\/h4><p>You can also use <tt>datastore<\/tt> as the first step to creating and running your own MapReduce algorithms in MATLAB.  Perhaps this topic will be a blog post in the future.<\/p><h4>Conclusion<a name=\"e9856795-1c05-4a1d-b706-07322a431ace\"><\/a><\/h4><p>Do you think you can use datastore with your big data? Let us know <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=1063#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_50f0cdb2f9dc4bff80c2ad47088e8fd6() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='50f0cdb2f9dc4bff80c2ad47088e8fd6 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 50f0cdb2f9dc4bff80c2ad47088e8fd6';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2014 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_50f0cdb2f9dc4bff80c2ad47088e8fd6()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2014b<br><\/p><\/div><!--\r\n50f0cdb2f9dc4bff80c2ad47088e8fd6 ##### SOURCE BEGIN #####\r\n%% Reading Big Data into MATLAB\r\n% Today I\u00e2\u20ac&#x2122;d like to introduce guest blogger\r\n% <mailto:sarah.zaranek@mathworks.com Sarah Wait Zaranek> who works for the\r\n% MATLAB Marketing team here at MathWorks. Sarah previously has written\r\n% about a variety of topics.  Mostly recently, she cowrote a post with me about\r\n% the <https:\/\/blogs.mathworks.com\/loren\/2014\/06\/17\/webcam-support-new-in-r2014a new webcam capabilities> \r\n% in MATLAB. Today, Sarah will be discussing\r\n% |datastore|, one of the new big data capabilities introduced in MATLAB\r\n% R2014b.\r\n\r\n%% About the Data\r\n% |datastore| is used for reading data that is too large to fit in memory.\r\n% For this example, we will be reading in data from the vehicle census of\r\n% Massachusetts. It is a catalog of information about vehicle registered\r\n% from 2008 to 2011.  The dataset contains information about individual\r\n% cars registered including vehicle type, location where the vehicle is\r\n% housed, rated MPG, and measured CO_2 emissions. You can learn more about\r\n% the data and even download it yourself,\r\n% <http:\/\/www.37billionmilechallenge.org\/ here>.  I have renamed the files\r\n% in the demo for clarity's sake, but this is where they came from\r\n% originally.\r\n\r\n%% What is a Datastore?\r\n% As mentioned, a <https:\/\/www.mathworks.com\/help\/matlab\/import_export\/what-is-a-datastore.html datastore>\r\n% is an object useful for reading collections of\r\n% data that are too large to fit in memory. \r\n\r\n%% Defining Our Input Data\r\n% |datastore| can work with a single file or a collection of files. In this\r\n% case, we will be reading from a single file. Our file does not include\r\n% variable names at the top of the file.  They are listed in separate\r\n% header file as defined below.\r\n\r\n% Define Data File and Header File\r\ndatafile = 'vehiclebig.csv';\r\nheaderfile = 'varnames.txt';\r\n\r\n% Read in Variable Names\r\nfileID = fopen(headerfile);\r\nvarnames = textscan(fileID,'%s');\r\nvarnames = varnames{:};\r\nfclose(fileID);\r\n\r\n%% Creating the DataStore\r\n% We can now create our datastore by giving the name of the data file as\r\n% the import to the |datastore| function.  We also specify our\r\n% datastore not use the first row of our file as variable names.  We\r\n% will set those variable names explicitly using the names found in the\r\n% 'varnames.txt' file.\r\n\r\nds = datastore(datafile,'ReadVariableNames',false);\r\n\r\n% Set Variable Names \r\nds.VariableNames = varnames\r\n\r\n%%\r\n% Note that we haven't read in our data yet.  We have just provided an\r\n% easy way to access it through |ds|, our datastore.\r\n\r\n%% Preview the Data\r\n% A really nice thing about a datastore is that you can preview your data\r\n% without having to load it all into memory.  |datastore| reads the data\r\n% into a <https:\/\/www.mathworks.com\/help\/matlab\/tables.html table> which is\r\n% a data type in MATLAB designed to work well with tabular data.\r\n\r\ndata = preview(ds);\r\nwhos data\r\n\r\ndata(:,1:7) % Look at first 7 variables\r\n\r\n\r\n%%\r\n% By default, |datastore| will read in every column of our dataset.\r\n% |datastore| makes an educated guess for the appropriate format for each\r\n% column (variable) of our data.  We can, however, specify a subset of\r\n% columns or different formats if we wish.\r\n\r\n%% Select Data to Import\r\n% We can specify which variables (columns) by setting the\r\n% |SelectedVariableNames| property of our datastore.  In this case, we only\r\n% want to bring in 5 columns out of the 45.\r\n\r\nds.SelectedVariableNames = {'model_year', 'veh_type', ...\r\n    'curbwt','mpg_adj','hybrid'};\r\n\r\npreview(ds)\r\n\r\n%% Adjust Variable Format\r\n% We can adjust the format of the data we wish to access by using the\r\n% |SelectedFormats| property.  We can specify to bring in the vehicle type\r\n% as a categorical variable by using the |%C| specifier. You can\r\n% learn more <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/advantages-of-using-categorical-arrays.html here>\r\n% about the benefits of using categorical arrays. \r\n\r\nds.SelectedFormats;\r\nds.SelectedFormats{2} = '%C'  % read  in as a categorical\r\n\r\n%% Read in First Chunk\r\n% We can use the |read| function to read in a chunk of our data. By\r\n% default, |read| reads in 20000 rows at a time.  This value can be\r\n% adjusted using the |RowsPerRead| property.\r\n\r\ntestdata = read(ds);\r\nwhos testdata\r\n\r\n%%\r\n% After you read in a chunk, you can use the |hasdata| function to see if\r\n% there is still additional data available to read from the datastore. \r\n\r\nhasdata(ds)\r\n\r\n%%\r\n% By using |hasdata| and |read| in a |while| loop with your |datastore|, you\r\n% can read in your entire dataset a piece at a time. We will put in \r\n% a counter just to track how many read operations took place in our loop.\r\n\r\ncounter = 0;\r\n\r\nwhile hasdata(ds)\r\n    % Read in Chunk\r\n    dataChunk = read(ds);\r\n    counter = counter + 1;\r\nend\r\n\r\ncounter\r\n\r\n%%\r\n% By using |reset|, we can reset our datastore and start reading at\r\n% the beginning of the file.\r\n\r\nreset(ds)\r\n\r\n%%\r\n% Now that we see how to get started using datastore, let's look at 3\r\n% different ways to use it to work with a dataset that does not entirely\r\n% fit in the memory of your machine. \r\n\r\n%% Example 1: Read Selected Columns of Data for Use in Memory\r\n% If you are interested in only processing certain columns of your text\r\n% file and those columns can fit in memory, you can use |datastore| to\r\n% bring in those particular columns from your text file. Then, you can work\r\n% with that data directly in memory.  In this example, we are only\r\n% interested in the model year and vehicle type of the cars that were\r\n% registered. We can use |readall| instead of |read| to import all the\r\n% selected data instead of just a chunk of it at a time.\r\n\r\nds.SelectedVariableNames = {'model_year', 'veh_type'};\r\n\r\ncardata = readall(ds);\r\nwhos cardata\r\n\r\n%%\r\n% Now that you have the data read into MATLAB, you can work with it like\r\n% you would normally work with your data in MATLAB. For this example, we\r\n% will just use the new |histogram| function introduced in R2014b to look\r\n% at the distribution of vehicle model years registered.\r\n\r\nfigure\r\nhistogram(cardata.model_year)\r\nhold on\r\nhistogram(cardata.model_year(cardata.veh_type == 'Car'))\r\nhold off\r\n\r\nxlabel('Model Year')\r\nlegend({'all vehicles', 'only cars'},'Location','southwest')\r\n\r\n%% Example 2: Filter Data Down to a Subset for Use in Memory\r\n% Another way to subset your data is to filter the data down a chunk at\r\n% time. Using |datastore| you can read in a chunk of data and keep only\r\n% data you need from that chunk. You then continue this process, chunk by\r\n% chunk, until you reach the end of the file and have only the subset of\r\n% the data you want to use.\r\n%\r\n% In this case, we want to extract a subset of data for cars that were\r\n% registered in 2011.  The new variables we are loading in (e.g., q1_2011),\r\n% contain either a one or zero.  Ones represent valid car registrations\r\n% during that time.  So we only save the rows which contain a valid\r\n% registration sometime in 2011 and discard the rest.\r\n\r\nreset(ds)\r\nds.SelectedVariableNames = {'model_year','veh_type',...\r\n    'q1_2011','q2_2011','q3_2011','q4_2011'};\r\n\r\ndata2011 = table;\r\n\r\nwhile hasdata(ds)\r\n    \r\n    % Read in Chunk\r\n    dataChunk = read(ds);\r\n    \r\n    % Find if Valid During Any Quarter in 2011\r\n    reg2011 = sum(dataChunk{:,3:end},2);\r\n    \r\n    % Extract Data to Keep (Cars Registered in 2011)\r\n    idx = reg2011 >= 1 & dataChunk.veh_type == 'Car';\r\n    \r\n    % Save to Final Table\r\n    data2011 = [data2011; dataChunk(idx,1:2)]; \r\n    \r\nend\r\n\r\nwhos data2011\r\n\r\nfigure\r\nhistogram(data2011.model_year)\r\nxlabel('Model Year')\r\nlegend({'cars registered in 2011'})\r\n\r\n%% Example 3:  Perform Analysis on Chunks of Data and Combine the Results\r\n% But what if we can't hold the subset of the data we are interested in\r\n% analyzing in memory? We could instead process the data a section at a\r\n% time and then combine intermediate results to get a final result. In this\r\n% case, let's look at the % of hybrid cars registered every quarter (in\r\n% terms of total cars registered). So, we compute a running total of the\r\n% number of cars registered per quarter as well as the number of hybrids\r\n% registered per quarter. Then, we calculate the final % when we have read\r\n% through the entire dataset.\r\n\r\n% Reset Datastore\r\nreset(ds)\r\n\r\n% Select Data to Import\r\nquarterNames = varnames(end-15:end);\r\nds.SelectedVariableNames = [{'veh_type', 'hybrid'} quarterNames'];\r\n\r\n% Read in Vehicle Type as a Categorical Variable\r\nds.SelectedFormats{1} = '%C';  \r\n\r\ntotalCars = zeros(length(quarterNames),1);\r\ntotalHybrids = zeros(length(quarterNames),1);\r\n\r\nwhile hasdata(ds)\r\n    \r\n    % Read in Chunk\r\n    dataChunk = read(ds);\r\n    \r\n    for ii = 1:length(quarterNames) % Loop over car model years\r\n        \r\n    % Extract Data \r\n    idx = dataChunk{:,quarterNames(ii)}== 1 & dataChunk.veh_type == 'Car';\r\n    idxHy = idx & dataChunk.hybrid == 1;\r\n   \r\n    % Perform Calculation\r\n    totalCarsChunk = sum(idx);\r\n    totalHybridsChunk = sum(idxHy);\r\n    \r\n    % Save Result \r\n    totalCars(ii) = totalCarsChunk + totalCars(ii);\r\n    totalHybrids(ii) = totalHybridsChunk + totalHybrids(ii);\r\n       \r\n    end\r\nend\r\n\r\n%% \r\npercentHybrid = (totalHybrids.\/totalCars)*100;\r\n\r\nfigure\r\nscatter(1:length(percentHybrid),percentHybrid,'filled')\r\nxlabel('Inspection Year')\r\nylabel('% Hybrids')\r\n\r\n% Label tick axes\r\nax = gca;\r\nax.TickLabelInterpreter = 'none';\r\nax.XTick = 1:length(quarterNames);\r\nax.XTickLabel = quarterNames;\r\nax.XTickLabelRotation = -45;\r\n\r\n%% Extending the Use of Datastore\r\n% You can also use |datastore| as the first step to creating and running\r\n% your own MapReduce algorithms in MATLAB.  Learn more about running\r\n% MapReduce algorithms with MATLAB\r\n% <https:\/\/www.mathworks.com\/discovery\/matlab-mapreduce-hadoop.html here>.\r\n% Perhaps this topic will be a blog post in the future.\r\n\r\n%% Conclusion\r\n% Do you think you can use datastore with your big data? Let us know\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=1063#respond here>. \r\n##### SOURCE END ##### 50f0cdb2f9dc4bff80c2ad47088e8fd6\r\n-->","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2014\/intro_datastore_blog_03.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Today I&#8217;d like to introduce guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team here at MathWorks. Sarah previously has written about a variety of topics.  Mostly recently, she cowrote a post with me about the <a href=\"https:\/\/blogs.mathworks.com\/loren\/2014\/06\/17\/webcam-support-new-in-r2014a\">new webcam capabilities<\/a> in MATLAB. Today, Sarah will be discussing <tt>datastore<\/tt>, one of the new big data capabilities introduced in MATLAB R2014b.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2014\/12\/03\/reading-big-data-into-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[63,64,45,6],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1063"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=1063"}],"version-history":[{"count":10,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1063\/revisions"}],"predecessor-version":[{"id":4734,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/1063\/revisions\/4734"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=1063"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=1063"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=1063"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}