Using memmapfile to Navigate through “Big Data” Binary Files

作者 Loren Shure, July 9, 2013

7 次查看（过去 30 天） | 0 个赞 | 3 个评论

This week, Ken Atwell from MATLAB product management weighs in with using a memmapfile as a way to navigate through binary files of "big data".

memmapfile (for "memory-mapped file") is used to access binary files without needing to resort to low-level file I/O functions like fread. It includes an ability to declare the structure of your binary data, freely mixing data types and sizes. Originally targeted at easing the reading of lists of records, memmapfile also has application in big data. Today's post will examine column-wise access of big binary files, and how to navigate through metadata that sometimes is at the beginning of binary files.

Experiment Parameters
Create Test File
memmapfile for Entire Data Set
memmapfile with Columnwise Access
Data File with XML Header
Read XML Header
Create the Memory-mapped File
Conclusion

Experiment Parameters

To get started, create a potentially large 2D matrix that is stored on disk. numRows and numColumns can be changed to experiment with different sizes. To keep things simple and snappy here, the matrix is under a gigabyte in size. This is hardly "big data", and you can adjust the parameters here to create a larger problem. Do note that, of course, the disk space required to run this code will grow with the matrix size you create.

scratchFolder = tempdir;
numRows = 1e5;
numColumns = 1e3;

Create Test File

Create the scratch file. This can take from a moment to many minutes to run, depending on the sizes declared above. Because data of type double is being created, the file will consume 8*numRows*numColumns bytes of free disk space.

The value of [r,c] in the matrix is set to be c*1,000,000+r. This will make it easy to glance at our output and recognize that we are getting the values that are expected.

filename = ['mmf' int2str(numRows) 'x' int2str(numColumns) '.dat'];
filename = fullfile(scratchFolder, filename);
f = fopen(filename, 'w');
for colNum = 1:numColumns
    column = (1:numRows)' + colNum*1000000;
    fwrite(f,column,'double');
end
fclose(f);

`memmapfile` for Entire Data Set

To create a memory-mapped file, we call memmapfile with these two arguments:

The filename containing the data
The 'Format' of the data, which is a cell array with three components: a. The data type (double in this example), b. the size of the data (a matrix of size numRows by numColumns in this example), and c. a name to assign to this data (m for "matrix" in this example)

This is basic usage of memmapfile, and it encapsulates the entire data set in a single access. When working with "big data", you will want to avoid singular accesses like this. If the size of the data is large enough, your computer may become unresponsive (" thrash ") as it busily creates swap space in an effort to read in the entire matrix. The if statement is here to prevent you from doing this accidentally. If you are experimenting with data sizes larger than the physical memory available in your computer, you will want to skip this step.

% Prevent a memory-busting matrix from being created.
if numRows*numColumns*8 > 1e9
    error('Size possibly too big; are you sure you want to do this?')
end

mm = memmapfile(filename, 'Format', {'double', [numRows numColumns], 'm'});
m = mm.Data.m;  %#ok<NASGU>

Regardless, clear m to free up whatever memory was used.

clear('m');

`memmapfile` with Columnwise Access

Here is a smarter way to access the big data a column at a time. Instead of creating a single variable that is numRows * numColumns large, we create a numRows * 1 vector, which is repeated numColumns times (note this code is now using the optional 'Repeat' argument to memmapfile). This subtle difference allows the big matrix to be read in one column at a time, presumably staying within available memory. The variable is named mj to indicate the 'j''th column of data.

mm = memmapfile(filename, 'Format', {'double', [numRows 1], 'mj'}, ...
               'Repeat', numColumns);

The code spot-checks the 17th column.

if ~isequal(mm.Data(17).mj, (1:numRows)' + 17*1000000)
    error('The data was not read back in correctly!');
end

memmapfile allows for creative uses of 'Repeat' if your application need it. For example, rather than a vector of an entire column, you can read in blocks of half a column:

memmapfile(filename, 'Format', {'double', [numRows/2 1], 'mj'}, 'Repeat', numColumns*2);

or blocks containing multiple columns:

memmapfile(filename, 'Format', {'double', [numRows*10 1], 'mj'}, 'Repeat', numColumns/10);

Of course, first ensure that your data's size is evenly divisible by these multiples, or you will create a memmapfile that does not accurately reflect the actual file that underlies it.

A note about memory-mapped files and virtual memory: If your application loops over many columns of memory-mapped data, you may find that memory usage as reported by the Windows Task Manager or the OS X Activity Monitor will begin to climb. This can be a little misleading. While memmapfile will consume sections of your computer's virtual memory space (only of practical consequence if you are still using a 32-bit version of MATLAB), physical memory (RAM) will not be used. The assignment of m above has the potential to fail only because that operation is pulling the contents of the entire memmapfile into a workspace variable, and workspace variables (including ans) reside in RAM. A comprehensive discussion of virtual memory is beyond the scope of this blog; the Wikipedia article on virtual memory is a starting point if you want to learn more.

Data File with XML Header

The above code assumes that the matrix appears at the very beginning of the data file. However, a number of data files begin with some form of metadata, followed by the "payload", the data itself.

For this blog, a file with some metadata followed by the "real" data will be created. The metadata is expressed using XML-style formatting. This particular format was created for this post, but it is representative of actual metadata. Typically, the metadata indicates an offset into the file where the actual data begins, which is expressed here in the headerLength attribute in the first line of the header. What follows next is a var to declare the name, type, and size of the variable contained in the file. This file will contain only one variable, but conceptually the file could contain multiple variables.

strNumC = int2str(numColumns);
strNumR = int2str(numRows);

header = [...
    '<datFile headerLength=00000000>' char(10) ...
    '  <var name="mj" type="double" size="' strNumR ',' strNumC '"/>' char(10) ...
    '</datFile>' char(10) ...
    ];

% Insert header length
header = strrep(header, '00000000', sprintf('%08.0f', length(header)));
disp(header)

<datFile headerLength=00000095>
  <var name="mj" type="double" size="100000,1000"/>
</datFile>

filename = ['mmf' int2str(numRows) 'x' int2str(numColumns) '_header.dat'];
filename = fullfile(scratchFolder, filename);
f = fopen(filename, 'w');
fwrite(f, header, 'char');
for colNum = 1:numColumns
    column = (1:numRows)' + colNum*1000000;
    fwrite(f, column, 'double');
end
fclose(f);

Read XML Header

The header will now be read back in and parsed. While xlmread could be used to get a DOM node to traverse the XML data structure, regular expressions can often be used as a quick and dirty way to scrape information from XML. If you are unfamiliar with regular expressions, it is sufficient for this example just to understand that:

(\d+) extracts a string of digits
(\w+) extracts a word (an alphanumeric string)
\s+ skips over whitespace

The first line of the file is read to determine the length of the header (extracted by a regular expression), and then the full header is read using this information. Finally, a second, more complex regular expression is used to extract the name, type, and size information for the variable contained in the binary data "blob" that follows the header.

f = fopen(filename, 'r');
firstLine = fgetl(f);
fclose(f);

firstLine %#ok<NOPTS>

firstLine =
<datFile headerLength=00000095>

% Get the length and convert the string to a double
headerLength = regexp(firstLine, 'headerLength=(\d+)', 'tokens');
headerLength = (str2double(headerLength{1}{1})) %#ok

headerLength =
    95

f = fopen(filename, 'r');
header = fread(f, headerLength, 'char=>char')';
fclose(f);

% Scan the metadata for type, size, and name
vars = regexp(header, 'name="(\w+)"\s+type="(\w+)"\s+size="(\d+),(\d+)"', ...
    'tokens');

Create the Memory-mapped File

Lastly, create a memmapfile for the variable . The cell array returned by regexp is transformed into a new cell array that matches the expected input arguments to the memmapfile function.

% Reorganize the data from XML into the form expected by memmapfile
mmfFormater = {...
    'Format', ...
        {vars{1}{2}, ...
        [str2double(vars{1}{3}), 1], ...
        vars{1}{1}} ...
    'Repeat', str2double(vars{1}{4})};

mm = memmapfile(filename, 'Offset', headerLength, mmfFormater{:});
mj = mm.Data(17).mj;  % Check the 17th column
if ~isequal(mj, (1:numRows)' + 17*1000000)
    error('The matrix ''mj'' was not read in correctly!');
end

Conclusion

I hope this blog will be useful to those readers struggling to import big blocks of binary data into MATLAB. Though not covered in this post, memmapfile can also be used to load row-major data, and 2D "tiles" of data.

When you are done experimenting, remember to delete the scratch files you have been creating.

Have you used memmapfile or some other technique to incrementally read from large binary files? Share your tips here!

Published with MATLAB® R2013a