{"id":725,"date":"2013-07-09T09:29:13","date_gmt":"2013-07-09T14:29:13","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=725"},"modified":"2019-10-23T10:58:17","modified_gmt":"2019-10-23T15:58:17","slug":"using-memmapfile-to-navigate-through-big-data-binary-files","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2013\/07\/09\/using-memmapfile-to-navigate-through-big-data-binary-files\/","title":{"rendered":"Using memmapfile to Navigate through &#8220;Big Data&#8221; Binary Files"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>This week, Ken Atwell from MATLAB product management weighs in with using a <tt>memmapfile<\/tt> as a way to navigate through binary files of \"big data\".<\/p><p><tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/memmapfile.html\">memmapfile<\/a><\/tt> (for \"memory-mapped file\") is used to access binary files without needing to resort to low-level file I\/O functions like <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/fread.html\">fread<\/a><\/tt>.  It includes an ability to declare the structure of your binary data, freely mixing data types and sizes.  Originally targeted at easing the reading of lists of records, <tt>memmapfile<\/tt> also has application in big data. Today's post will examine column-wise access of big binary files, and how to navigate through metadata that sometimes is at the beginning of binary files.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#64df40cc-f9b4-438d-8c1a-71c6b2e81b1a\">Experiment Parameters<\/a><\/li><li><a href=\"#f4f11858-00f9-4c51-b348-4cfd139fef4d\">Create Test File<\/a><\/li><li><a href=\"#84c0e2d2-4a9d-4fb1-9694-e6636570d2b0\"><tt>memmapfile<\/tt> for Entire Data Set<\/a><\/li><li><a href=\"#fe06e7ba-2bac-4cd0-9836-e3a7f56bc45e\"><tt>memmapfile<\/tt> with Columnwise Access<\/a><\/li><li><a href=\"#66afe53b-bbc8-4b94-a660-b9b810f2283a\">Data File with XML Header<\/a><\/li><li><a href=\"#3db9cefe-f94a-4351-9105-c03fa7e50027\">Read XML Header<\/a><\/li><li><a href=\"#3474f0f7-3c07-4c32-84e6-04f8310006e1\">Create the Memory-mapped File<\/a><\/li><li><a href=\"#0ffb6572-7578-47e6-bc0a-ebd4345e8d01\">Conclusion<\/a><\/li><\/ul><\/div><h4>Experiment Parameters<a name=\"64df40cc-f9b4-438d-8c1a-71c6b2e81b1a\"><\/a><\/h4><p>To get started, create a potentially large 2D matrix that is stored on disk.  <tt>numRows<\/tt> and <tt>numColumns<\/tt> can be changed to experiment with different sizes.  To keep things simple and snappy here, the matrix is under a gigabyte in size.  This is hardly \"big data\", and you can adjust the parameters here to create a larger problem.  Do note that, of course, the disk space required to run this code will grow with the matrix size you create.<\/p><pre class=\"codeinput\">scratchFolder = tempdir;\r\nnumRows = 1e5;\r\nnumColumns = 1e3;\r\n<\/pre><h4>Create Test File<a name=\"f4f11858-00f9-4c51-b348-4cfd139fef4d\"><\/a><\/h4><p>Create the scratch file.  This can take from a moment to many minutes to run, depending on the sizes declared above.  Because data of type <tt>double<\/tt> is being created, the file will consume <tt>8*numRows*numColumns<\/tt> bytes of free disk space.<\/p><p>The value of <tt>[r,c]<\/tt> in the matrix is set to be <tt>c*1,000,000+r<\/tt>.  This will make it easy to glance at our output and recognize that we are getting the values that are expected.<\/p><pre class=\"codeinput\">filename = [<span class=\"string\">'mmf'<\/span> int2str(numRows) <span class=\"string\">'x'<\/span> int2str(numColumns) <span class=\"string\">'.dat'<\/span>];\r\nfilename = fullfile(scratchFolder, filename);\r\nf = fopen(filename, <span class=\"string\">'w'<\/span>);\r\n<span class=\"keyword\">for<\/span> colNum = 1:numColumns\r\n    column = (1:numRows)' + colNum*1000000;\r\n    fwrite(f,column,<span class=\"string\">'double'<\/span>);\r\n<span class=\"keyword\">end<\/span>\r\nfclose(f);\r\n<\/pre><h4><tt>memmapfile<\/tt> for Entire Data Set<a name=\"84c0e2d2-4a9d-4fb1-9694-e6636570d2b0\"><\/a><\/h4><p>To create a memory-mapped file, we call <tt>memmapfile<\/tt> with these two arguments:<\/p><div><ol><li>The filename containing the data<\/li><li>The <tt>'Format'<\/tt> of the data, which is a cell array with three components: a. The data type (<tt>double<\/tt> in this example), b. the size of the data (a matrix of size <tt>numRows<\/tt> by <tt>numColumns<\/tt> in this example), and c. a name to assign to this data (<tt>m<\/tt> for \"matrix\" in this example)<\/li><\/ol><\/div><p>This is basic usage of <tt>memmapfile<\/tt>, and it encapsulates the entire data set in a single access. <b>When working with \"big data\", you will want to avoid singular accesses like this.<\/b>  If the size of the data is large enough, your computer may become unresponsive (\" <a href=\"http:\/\/en.wikipedia.org\/wiki\/Thrashing_(computer_science)\">thrash<\/a> \") as it busily creates swap space in an effort to read in the entire matrix.  The <tt>if<\/tt> statement is here to prevent you from doing this accidentally.  If you are experimenting with data sizes larger than the physical memory available in your computer, you will want to skip this step.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Prevent a memory-busting matrix from being created.<\/span>\r\n<span class=\"keyword\">if<\/span> numRows*numColumns*8 &gt; 1e9\r\n    error(<span class=\"string\">'Size possibly too big; are you sure you want to do this?'<\/span>)\r\n<span class=\"keyword\">end<\/span>\r\n\r\nmm = memmapfile(filename, <span class=\"string\">'Format'<\/span>, {<span class=\"string\">'double'<\/span>, [numRows numColumns], <span class=\"string\">'m'<\/span>});\r\nm = mm.Data.m;  <span class=\"comment\">%#ok&lt;NASGU&gt;<\/span>\r\n<\/pre><p>Regardless, clear <tt>m<\/tt> to free up whatever memory was used.<\/p><pre class=\"codeinput\">clear(<span class=\"string\">'m'<\/span>);\r\n<\/pre><h4><tt>memmapfile<\/tt> with Columnwise Access<a name=\"fe06e7ba-2bac-4cd0-9836-e3a7f56bc45e\"><\/a><\/h4><p>Here is a smarter way to access the big data a column at a time.  Instead of creating a single variable that is <tt>numRows * numColumns<\/tt> large, we create a <tt>numRows * 1<\/tt> vector, which is repeated <tt>numColumns<\/tt> times (note this code is now using the optional <tt>'Repeat'<\/tt> argument to <tt>memmapfile<\/tt>).  This subtle difference allows the big matrix to be read in one column at a time, presumably staying within available memory.  The variable is named <tt>mj<\/tt> to indicate the 'j''th column of data.<\/p><pre class=\"codeinput\">mm = memmapfile(filename, <span class=\"string\">'Format'<\/span>, {<span class=\"string\">'double'<\/span>, [numRows 1], <span class=\"string\">'mj'<\/span>}, <span class=\"keyword\">...<\/span>\r\n               <span class=\"string\">'Repeat'<\/span>, numColumns);\r\n<\/pre><p>The code spot-checks the 17th column.<\/p><pre class=\"codeinput\"><span class=\"keyword\">if<\/span> ~isequal(mm.Data(17).mj, (1:numRows)' + 17*1000000)\r\n    error(<span class=\"string\">'The data was not read back in correctly!'<\/span>);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><p><tt>memmapfile<\/tt> allows for creative uses of 'Repeat' if your application need it.  For example, rather than a vector of an entire column, you can read in blocks of half a column:<\/p><pre class=\"language-matlab\">memmapfile(filename, <span class=\"string\">'Format'<\/span>, {<span class=\"string\">'double'<\/span>, [numRows\/2 1], <span class=\"string\">'mj'<\/span>}, <span class=\"string\">'Repeat'<\/span>, numColumns*2);\r\n<\/pre><p>or blocks containing multiple columns:<\/p><pre class=\"language-matlab\">memmapfile(filename, <span class=\"string\">'Format'<\/span>, {<span class=\"string\">'double'<\/span>, [numRows*10 1], <span class=\"string\">'mj'<\/span>}, <span class=\"string\">'Repeat'<\/span>, numColumns\/10);\r\n<\/pre><p>Of course, first ensure that your data's size is evenly divisible by these multiples, or you will create a <tt>memmapfile<\/tt> that does not accurately reflect the actual file that underlies it.<\/p><p><b>A note about memory-mapped files and virtual memory<\/b>: If your application loops over many columns of memory-mapped data, you may find that memory usage as reported by the <a title=\"http:\/\/support.microsoft.com\/kb\/323527 (link no longer works)\">Windows Task Manager<\/a> or the <a title=\"http:\/\/support.apple.com\/kb\/ht1342\">OS X Activity Monitor<\/a> will begin to climb.  This can be a little misleading.  While <tt>memmapfile<\/tt> will consume sections of your computer's virtual memory space (only of practical consequence if you are still using a 32-bit version of MATLAB), physical memory (RAM) will not be used.  The assignment of <tt>m<\/tt> above has the potential to fail only because that operation is pulling the contents of the entire <tt>memmapfile<\/tt> into a workspace variable, and workspace variables (including <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/ans.html\">ans<\/a><\/tt>) reside in RAM.  A comprehensive discussion of virtual memory is beyond the scope of this blog; the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Virtual_memory\">Wikipedia article on virtual memory<\/a> is a starting point if you want to learn more.<\/p><h4>Data File with XML Header<a name=\"66afe53b-bbc8-4b94-a660-b9b810f2283a\"><\/a><\/h4><p>The above code assumes that the matrix appears at the very beginning of the data file.  However, a number of data files begin with some form of metadata, followed by the \"payload\", the data itself.<\/p><p>For this blog, a file with some metadata followed by the \"real\" data will be created. The metadata is expressed using XML-style formatting.  This particular format was created for this post, but it is representative of actual metadata.  Typically, the metadata indicates an offset into the file where the actual data begins, which is expressed here in the <tt>headerLength<\/tt> attribute in the first line of the header.  What follows next is a <tt>var<\/tt> to declare the name, type, and size of the variable contained in the file.  This file will contain only one variable, but conceptually the file could contain multiple variables.<\/p><pre class=\"codeinput\">strNumC = int2str(numColumns);\r\nstrNumR = int2str(numRows);\r\n\r\nheader = [<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'&lt;datFile headerLength=00000000&gt;'<\/span> char(10) <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'  &lt;var name=\"mj\" type=\"double\" size=\"'<\/span> strNumR <span class=\"string\">','<\/span> strNumC <span class=\"string\">'\"\/&gt;'<\/span> char(10) <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'&lt;\/datFile&gt;'<\/span> char(10) <span class=\"keyword\">...<\/span>\r\n    ];\r\n\r\n<span class=\"comment\">% Insert header length<\/span>\r\nheader = strrep(header, <span class=\"string\">'00000000'<\/span>, sprintf(<span class=\"string\">'%08.0f'<\/span>, length(header)));\r\ndisp(header)\r\n<\/pre><pre class=\"codeoutput\">&lt;datFile headerLength=00000095&gt;\r\n  &lt;var name=\"mj\" type=\"double\" size=\"100000,1000\"\/&gt;\r\n&lt;\/datFile&gt;\r\n\r\n<\/pre><pre class=\"codeinput\">filename = [<span class=\"string\">'mmf'<\/span> int2str(numRows) <span class=\"string\">'x'<\/span> int2str(numColumns) <span class=\"string\">'_header.dat'<\/span>];\r\nfilename = fullfile(scratchFolder, filename);\r\nf = fopen(filename, <span class=\"string\">'w'<\/span>);\r\nfwrite(f, header, <span class=\"string\">'char'<\/span>);\r\n<span class=\"keyword\">for<\/span> colNum = 1:numColumns\r\n    column = (1:numRows)' + colNum*1000000;\r\n    fwrite(f, column, <span class=\"string\">'double'<\/span>);\r\n<span class=\"keyword\">end<\/span>\r\nfclose(f);\r\n<\/pre><h4>Read XML Header<a name=\"3db9cefe-f94a-4351-9105-c03fa7e50027\"><\/a><\/h4><p>The header will now be read back in and parsed.  While <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/xmlread.html\">xlmread<\/a><\/tt> could be used to get a DOM node to traverse the XML data structure, <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html\">regular expressions<\/a> can often be used as a quick and dirty way to scrape information from XML.  If you are unfamiliar with regular expressions, it is sufficient for this example just to understand that:<\/p><div><ul><li><tt>(\\d+)<\/tt> extracts a string of digits<\/li><li><tt>(\\w+)<\/tt> extracts a word (an alphanumeric string)<\/li><li><tt>\\s+<\/tt> skips over whitespace<\/li><\/ul><\/div><p>The first line of the file is read to determine the length of the header (extracted by a regular expression), and then the full header is read using this information.  Finally, a second, more complex regular expression is used to extract the name, type, and size information for the variable contained in the binary data \"blob\" that follows the header.<\/p><pre class=\"codeinput\">f = fopen(filename, <span class=\"string\">'r'<\/span>);\r\nfirstLine = fgetl(f);\r\nfclose(f);\r\n\r\nfirstLine <span class=\"comment\">%#ok&lt;NOPTS&gt;<\/span>\r\n<\/pre><pre class=\"codeoutput\">firstLine =\r\n&lt;datFile headerLength=00000095&gt;\r\n<\/pre><pre class=\"codeinput\"><span class=\"comment\">% Get the length and convert the string to a double<\/span>\r\nheaderLength = regexp(firstLine, <span class=\"string\">'headerLength=(\\d+)'<\/span>, <span class=\"string\">'tokens'<\/span>);\r\nheaderLength = (str2double(headerLength{1}{1})) <span class=\"comment\">%#ok<\/span>\r\n<\/pre><pre class=\"codeoutput\">headerLength =\r\n    95\r\n<\/pre><pre class=\"codeinput\">f = fopen(filename, <span class=\"string\">'r'<\/span>);\r\nheader = fread(f, headerLength, <span class=\"string\">'char=&gt;char'<\/span>)';\r\nfclose(f);\r\n\r\n<span class=\"comment\">% Scan the metadata for type, size, and name<\/span>\r\nvars = regexp(header, <span class=\"string\">'name=\"(\\w+)\"\\s+type=\"(\\w+)\"\\s+size=\"(\\d+),(\\d+)\"'<\/span>, <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'tokens'<\/span>);\r\n<\/pre><h4>Create the Memory-mapped File<a name=\"3474f0f7-3c07-4c32-84e6-04f8310006e1\"><\/a><\/h4><p>Lastly, create a <tt>memmapfile<\/tt> for the  variable .  The cell array returned by <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html\">regexp<\/a><\/tt> is transformed into a new cell array that matches the expected input arguments to the <tt>memmapfile<\/tt> function.<\/p><pre class=\"codeinput\"><span class=\"comment\">% Reorganize the data from XML into the form expected by memmapfile<\/span>\r\nmmfFormater = {<span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'Format'<\/span>, <span class=\"keyword\">...<\/span>\r\n        {vars{1}{2}, <span class=\"keyword\">...<\/span>\r\n        [str2double(vars{1}{3}), 1], <span class=\"keyword\">...<\/span>\r\n        vars{1}{1}} <span class=\"keyword\">...<\/span>\r\n    <span class=\"string\">'Repeat'<\/span>, str2double(vars{1}{4})};\r\n\r\nmm = memmapfile(filename, <span class=\"string\">'Offset'<\/span>, headerLength, mmfFormater{:});\r\nmj = mm.Data(17).mj;  <span class=\"comment\">% Check the 17th column<\/span>\r\n<span class=\"keyword\">if<\/span> ~isequal(mj, (1:numRows)' + 17*1000000)\r\n    error(<span class=\"string\">'The matrix ''mj'' was not read in correctly!'<\/span>);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><h4>Conclusion<a name=\"0ffb6572-7578-47e6-bc0a-ebd4345e8d01\"><\/a><\/h4><p>I hope this blog will be useful to those readers struggling to import big blocks of binary data into MATLAB.  Though not covered in this post, <tt>memmapfile<\/tt> can also be used to load row-major data, and 2D \"tiles\" of data.<\/p><p>When you are done experimenting, remember to delete the scratch files you have been creating.<\/p><p>Have you used <tt>memmapfile<\/tt> or some other technique to incrementally read from large binary files?  Share your tips <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=725#respond\">here<\/a>!<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_f1023e4f469c4e8ab98d0b0c2568f364() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='f1023e4f469c4e8ab98d0b0c2568f364 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' f1023e4f469c4e8ab98d0b0c2568f364';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2013 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_f1023e4f469c4e8ab98d0b0c2568f364()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013a<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013a<br><\/p><\/div><!--\r\nf1023e4f469c4e8ab98d0b0c2568f364 ##### SOURCE BEGIN #####\r\n%% Using memmapfile to Navigate through \"Big Data\" Binary Files\r\n%\r\n% This week, Ken Atwell from MATLAB product management weighs in with using\r\n% a |memmapfile| as a way to navigate through binary files of \"big data\".\r\n%\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/memmapfile.html memmapfile>|\r\n% (for \"memory-mapped file\") is used to access binary files\r\n% without needing to resort to low-level file I\/O functions like |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/fread.html fread>|.  It includes an ability to\r\n% declare the structure of your binary data, freely mixing data types and sizes.  Originally\r\n% targeted at easing the reading of lists of records, |memmapfile| also has application in big data.\r\n% Today's post will examine column-wise access of big binary files, and how to navigate \r\n% through metadata that sometimes is at the beginning of binary files.\r\n\r\n%% Experiment Parameters \r\n% To get started, create a potentially large 2D matrix that is stored on disk.  |numRows| and\r\n% |numColumns| can be changed to experiment with different sizes.  To keep things simple and\r\n% snappy here, the matrix is under a gigabyte in size.  This is hardly \"big\r\n% data\", and you can adjust the parameters here to create a larger problem.  Do note that, of course,\r\n% the disk space required to run this code will grow with the matrix size you create.\r\n\r\nscratchFolder = tempdir;\r\nnumRows = 1e5;\r\nnumColumns = 1e3;\r\n\r\n%% Create Test File\r\n% Create the scratch file.  This can take from a moment to many minutes to run, depending\r\n% on the sizes declared above.  Because data of type |double| is being created, the file will consume |8*numRows*numColumns| bytes of\r\n% free disk space.\r\n%\r\n% The value of |[r,c]| in the matrix is set to be |c*1,000,000+r|.  This\r\n% will make it easy to\r\n% glance at our output and recognize that we are getting the values that are expected.\r\n\r\nfilename = ['mmf' int2str(numRows) 'x' int2str(numColumns) '.dat'];\r\nfilename = fullfile(scratchFolder, filename);\r\nf = fopen(filename, 'w');\r\nfor colNum = 1:numColumns\r\n    column = (1:numRows)' + colNum*1000000;\r\n    fwrite(f,column,'double');\r\nend\r\nfclose(f);\r\n\r\n%% |memmapfile| for Entire Data Set\r\n% To create a memory-mapped file, we call |memmapfile| with these two arguments:\r\n%\r\n% # The filename containing the data\r\n% # The |'Format'| of the data, which is a cell array with three\r\n% components: a. The data type (|double| in this example), b. the size of\r\n% the data (a matrix of size |numRows| by |numColumns| in this example),\r\n% and c. a name to assign to this data (|m| for \"matrix\" in this example)\r\n%\r\n% This is basic usage of |memmapfile|, and it encapsulates the entire data set in a single access.  \r\n% *When working with \"big data\", you will want to avoid singular accesses\r\n% like this.*  If the\r\n% size of the data is large enough, your computer may become unresponsive (\" <http:\/\/en.wikipedia.org\/wiki\/Thrashing_(computer_science) thrash> \") as it busily\r\n% creates swap space in an effort to read in the entire matrix.  The |if| statement is here to prevent\r\n% you from doing this accidentally.  If you are experimenting with data sizes larger than the\r\n% physical memory available in your computer, you will want to skip this step.\r\n\r\n% Prevent a memory-busting matrix from being created.\r\nif numRows*numColumns*8 > 1e9 \r\n    error('Size possibly too big; are you sure you want to do this?')\r\nend\r\n\r\nmm = memmapfile(filename, 'Format', {'double', [numRows numColumns], 'm'});\r\nm = mm.Data.m;  %#ok<NASGU>\r\n\r\n%%\r\n% Regardless, clear |m| to free up whatever memory was used.\r\n\r\nclear('m');\r\n\r\n%% |memmapfile| with Columnwise Access\r\n% Here is a smarter way to access the big data a column at a time.  Instead of creating a single\r\n% variable that is |numRows * numColumns| large, we create a |numRows * 1| vector, which is repeated\r\n% |numColumns| times (note this code is now using the optional |'Repeat'| argument to |memmapfile|).  This subtle\r\n% difference allows the big matrix to be read in one column at a time, presumably staying within\r\n% available memory.  The variable is named |mj| to indicate the\r\n% 'j''th column of data.\r\n\r\nmm = memmapfile(filename, 'Format', {'double', [numRows 1], 'mj'}, ...\r\n               'Repeat', numColumns);\r\n\r\n%%\r\n% The code spot-checks the 17th column.\r\n\r\nif ~isequal(mm.Data(17).mj, (1:numRows)' + 17*1000000)\r\n    error('The data was not read back in correctly!');\r\nend\r\n\r\n%%\r\n% |memmapfile| allows for creative uses of 'Repeat' if\r\n% your application need it.  For example, rather than a vector of an\r\n% entire column, you can read in blocks of half a column:\r\n%\r\n%   memmapfile(filename, 'Format', {'double', [numRows\/2 1], 'mj'}, 'Repeat', numColumns*2);\r\n%\r\n% or blocks containing multiple columns:\r\n%\r\n%   memmapfile(filename, 'Format', {'double', [numRows*10 1], 'mj'}, 'Repeat', numColumns\/10);\r\n%\r\n% Of course, first ensure that your data's size is evenly divisible by\r\n% these multiples, or you will create a |memmapfile| that does not\r\n% accurately reflect the actual file that underlies it.\r\n\r\n%%\r\n% *A note about memory-mapped files and virtual memory*:\r\n% If your application loops over many columns of memory-mapped\r\n% data, you may find that memory usage as reported by the\r\n% <http:\/\/support.microsoft.com\/kb\/323527 Windows Task Manager>\r\n% or the <http:\/\/support.apple.com\/kb\/ht1342 OS X Activity Monitor> will begin to climb.  This can be a little\r\n% misleading.  While |memmapfile| will consume sections of your computer's virtual memory space\r\n% (only of practical consequence if you are still using a 32-bit version of MATLAB),\r\n% physical memory (RAM) will not be used.  The assignment of |m| above has\r\n% the potential to fail only because that operation is pulling the contents of the entire\r\n% |memmapfile| into a workspace variable, and workspace variables\r\n% (including |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/ans.html ans>|)\r\n% reside in RAM.  A comprehensive discussion of virtual memory is beyond\r\n% the scope of this blog; the <http:\/\/en.wikipedia.org\/wiki\/Virtual_memory\r\n% Wikipedia article on virtual memory> is a starting point if you want\r\n% to learn more.\r\n\r\n%% Data File with XML Header\r\n% The above code assumes that the matrix appears at the very beginning of the data file.  However,\r\n% a number of data files begin with some form of metadata, followed by the \"payload\",\r\n% the data itself.\r\n%\r\n% For this blog, a file with some metadata followed by the \"real\" data will be created.\r\n% The metadata is expressed using XML-style formatting.  This particular format was created for this\r\n% post, but it is representative of actual metadata.  Typically, the metadata indicates\r\n% an offset into the file where the actual data begins, which is expressed\r\n% here in the |headerLength|\r\n% attribute in the first line of the header.  What follows next is a |var|\r\n% to declare the name, type, and size of the variable contained in the\r\n% file.  This file will contain only one variable, but conceptually the\r\n% file could contain multiple variables.\r\n\r\nstrNumC = int2str(numColumns);\r\nstrNumR = int2str(numRows);\r\n\r\nheader = [...\r\n    '<datFile headerLength=00000000>' char(10) ...\r\n    '  <var name=\"mj\" type=\"double\" size=\"' strNumR ',' strNumC '\"\/>' char(10) ...\r\n    '<\/datFile>' char(10) ...\r\n    ];\r\n\r\n% Insert header length\r\nheader = strrep(header, '00000000', sprintf('%08.0f', length(header)));\r\ndisp(header)\r\n\r\n%%\r\nfilename = ['mmf' int2str(numRows) 'x' int2str(numColumns) '_header.dat'];\r\nfilename = fullfile(scratchFolder, filename);\r\nf = fopen(filename, 'w');\r\nfwrite(f, header, 'char');\r\nfor colNum = 1:numColumns\r\n    column = (1:numRows)' + colNum*1000000;\r\n    fwrite(f, column, 'double');\r\nend\r\nfclose(f);\r\n\r\n%% Read XML Header\r\n% The header will now be read back in and parsed.  While |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/xmlread.html xlmread>| could be used to get a DOM node to\r\n% traverse the XML data structure, <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/regular-expressions.html regular expressions> can often be used\r\n% as a quick and dirty way to scrape information from XML.  If you are\r\n% unfamiliar with regular expressions, it is sufficient for this\r\n% example just to understand that:\r\n%\r\n% * |(\\d+)| extracts a string of digits \r\n% * |(\\w+)| extracts a word (an alphanumeric string)\r\n% * |\\s+| skips over whitespace\r\n%\r\n% The first line of the file is read to determine the length of the header (extracted by a regular\r\n% expression), and then the full header is read using this information.  Finally, a second, more\r\n% complex regular expression is used to extract the name, type, and size information for the \r\n% variable contained in the binary data \"blob\" that follows the header.\r\n\r\nf = fopen(filename, 'r');\r\nfirstLine = fgetl(f);\r\nfclose(f);\r\n\r\nfirstLine %#ok<NOPTS>\r\n\r\n%%\r\n\r\n% Get the length and convert the string to a double\r\nheaderLength = regexp(firstLine, 'headerLength=(\\d+)', 'tokens'); \r\nheaderLength = (str2double(headerLength{1}{1})) %#ok\r\n\r\n%%\r\n\r\nf = fopen(filename, 'r');\r\nheader = fread(f, headerLength, 'char=>char')';\r\nfclose(f);\r\n\r\n% Scan the metadata for type, size, and name\r\nvars = regexp(header, 'name=\"(\\w+)\"\\s+type=\"(\\w+)\"\\s+size=\"(\\d+),(\\d+)\"', ...\r\n    'tokens');\r\n\r\n%% Create the Memory-mapped File\r\n% Lastly, create a |memmapfile| for the  variable .  The cell array returned by |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/regexp.html regexp>| is\r\n% transformed into a new cell array that matches the expected input arguments to the |memmapfile| function.\r\n\r\n% Reorganize the data from XML into the form expected by memmapfile\r\nmmfFormater = {...\r\n    'Format', ...\r\n        {vars{1}{2}, ...\r\n        [str2double(vars{1}{3}), 1], ...\r\n        vars{1}{1}} ...\r\n    'Repeat', str2double(vars{1}{4})};\r\n\r\nmm = memmapfile(filename, 'Offset', headerLength, mmfFormater{:});\r\nmj = mm.Data(17).mj;  % Check the 17th column\r\nif ~isequal(mj, (1:numRows)' + 17*1000000)\r\n    error('The matrix ''mj'' was not read in correctly!');\r\nend\r\n\r\n%% Conclusion\r\n% I hope this blog will be useful to those readers struggling to import big blocks of binary data\r\n% into MATLAB.  Though not covered in this post, |memmapfile| can also be used to load row-major data,\r\n% and 2D \"tiles\" of data.\r\n%\r\n% When you are done experimenting, remember to delete the scratch files you have been creating.\r\n%\r\n% Have you used |memmapfile| or some other technique to incrementally read from large binary files?  Share\r\n% your tips <https:\/\/blogs.mathworks.com\/loren\/?p=725#respond here>!\r\n\r\n\r\n##### SOURCE END ##### f1023e4f469c4e8ab98d0b0c2568f364\r\n-->","protected":false},"excerpt":{"rendered":"<!--introduction--><p>This week, Ken Atwell from MATLAB product management weighs in with using a <tt>memmapfile<\/tt> as a way to navigate through binary files of \"big data\".... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2013\/07\/09\/using-memmapfile-to-navigate-through-big-data-binary-files\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[45,7],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/725"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=725"}],"version-history":[{"count":17,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/725\/revisions"}],"predecessor-version":[{"id":3480,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/725\/revisions\/3480"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=725"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=725"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=725"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}