Big Data in MAT Files

著者 Loren Shure, May 29, 2019

13 ビュー (過去 30 日間) | 0 いいね | 0 コメント

Today's guest blogger is Adam Filion, a Senior Product Manager at MathWorks. Adam helps manage and prioritize our development efforts in data science and big data.

MAT files are an easy and common way to store MATLAB variables to disk. They support all MATLAB variable types, have good data compression, and can be accessed or created from other applications through an external API. MATLAB users sometimes have so much data stored in MAT files that they can't load all the data at once. In this post, we will explore different situations and solutions for analyzing large amounts of data stored in MAT files.

Introduction to MAT Files
Large Collections of Small MAT Files
Large MAT Files with Many Small Variables
Large MAT Files with Large Variables
MAT Files Logged from Simulink Simulations
Summary

Introduction to MAT Files

Using MAT files

MATLAB provides the ability to save variables to MAT files through the save command.

a = pi;
b = rand(1,10);
save mydata.mat a b

These variables can be returned to the workspace using load.

% return all variables to the workspace
load mydata.mat
% find which variables are contained in a MAT file
varNames = who("-file","mydata.mat");
% return only the second variable to the workspace
load("mydata.mat",varNames{2})

MAT file versions

MAT files have evolved over time and several different versions exist. You can change the version to use when saving data by passing an additional flag, such as "-v7.3", to the save command. The biggest differences are summarized in the table below.

Version 7 is the default and should be used unless you need the additional functionality provided in Version 7.3. This is because, as mentioned in the documentation, Version 7.3 contains additional header information and may result in larger files than Version 7 when storing small amounts of data. Only create Version 6 or Version 4 MAT files if you need compatibility with older legacy applications.

When to store big data in MAT files

Most users analyzing large amounts of MAT file data did not choose the storage format themselves; but if you could, when would it make sense to store big data in MAT files? They are a good choice when the following three conditions apply:

The data is originally recorded in MAT files. This occurs when saving variables from the MATLAB workspace, logging data from Simulink simulations, or recording data from certain third-party data loggers which generate MAT files automatically. If your data does not naturally come in MAT files, it usually should be left in its original format.
The data is staying in the MATLAB ecosystem. MAT files are simple to use and are a lossless storage format, meaning that you will never lose any information or accuracy when storing MATLAB variables. However, since they are not easily accessible from other applications, it is typically better to use another file format (e.g. csv, Parquet, etc.) when exchanging data with other applications.
MAT files easily work in your file storage system. Some file systems impose additional requirements on files stored within them. For example, in the Hadoop Distributed File System (HDFS) it is difficult to use files that are not splitable, which is a feature MAT files do not support. In such situations, you should consider if a different file format that supports the file system requirements would be a better choice.

Big data in MAT file situations

If all your MAT file data can be easily loaded into memory and analyzed at the same time, use the load command outlined at the beginning. For the rest of this post, we will explore the four general situations when MAT file data gets too large to work with at once.

Large collections of small MAT files
Large MAT files with many small variables
Large MAT files with large variables
MAT files logged from Simulink simulations

Large Collections of Small MAT Files

Often data is recorded from different entities (e.g. weather stations, vehicles, simulations, etc.) and each entity is stored in a separate file. Even if each individual MAT file can easily fit into memory, the total collection can grow large enough that we cannot work with all of it at once. When this happens, there are two solutions based on the type of analysis we need to do.

Embarrassingly Parallel Analysis

If the work we are doing is embarrassingly parallel, meaning that each file can be analyzed in isolation, then we can loop through the files one at a time. If Parallel Computing Toolbox is available, we can accelerate the process by using a parfor loop instead of a for loop.

% find .mat files in current directory
d = dir("*.mat");
% loop through with a for loop, or use parfor
parfor ii = 1:length(d)
    % load the next .mat file
    data = load(d(ii).name);
    % perform your analysis on each individual file
    doAnalysis()
end

Inherently Sequential Analysis

When our files cannot be analyzed in isolation, we need to change our approach. The fileDatastore gives access to large collections of files by using a custom file reader function. For example, if your analysis only needs the variable "b" from your MAT files, you can use a reader function such as:

function data = myReader(fileName,varName)
  matData = load(fileName,varName);
  data = matData.(varName);
end

If your data is stored in more complicated or irregular formats, you can use any arbitrary code in your reader function to return the values in the format you need. Once we define our reader function, we can create a fileDatastore, which will read one file at a time using our reader function.

fds = fileDatastore("*.mat", "ReadFcn", @(fn) myReader(fn,"b"), "UniformRead", true);

Note that by default the fileDatastore will return each file's contents as an element in a cell array. The UniformRead option will instead keep the data's original format and vertically concatenate the data from different files.

After creating the datastore we can read a portion of the dataset with the read method or analyze the full out-of-memory dataset with tall arrays.

% read once from the datastore
t = read(fds);
% create a tall array
tall_t = tall(fds);

Unlike the load command, fileDatastore also supports remote storage systems including Amazon S3, Azure Blob Storage and the Hadoop Distributed File System. For example, to use Amazon S3 make the following modifications:

setenv("AWS_ACCESS_KEY_ID", "YOUR_AWS_ACCESS_KEY")
setenv("AWS_SECRET_ACCESS_KEY", "YOUR_AWS_SECRET_ACCESS_KEY")
fds = fileDatastore("s3://bucketname/dataset/*.mat", "ReadFcn", @(fn) myReader(fn,"b"), "UniformRead", true);

However, note that the fileDatastore automatically makes a local copy of each file it reads, which may result in downloading the entire dataset when it is stored remotely. If this is problematic, consider rewriting your data to another file format so you can use a datastore that does not require local copies. This is discussed in more detail later in this post.

Large MAT Files with Many Small Variables

MAT files can individually be too large to load either because they have many small variables or have large variables. Files with many small variables arise when logging many signals from simulations or data loggers, or by adding more variables to a MAT file over time using the save command's -append option.

c = eye(10);
% add another variable to the file
save mydata.mat c -append

Use a Subset of Variables

When working with MAT files containing too many small variables to load all at once, one approach is to only load certain variables needed for your analysis as we did in the prior section. If this reduces the data needed from each individual file such that each call to the read method fits into memory, then we can use the example from the previous section to avoid running out of memory.

Use a Portion of All Variables

However, if even after selecting only the necessary variables the data from individual files is still too large to fit into memory then we must try a different approach. In the prior section we used fileDatastore to read entire MAT files with a custom reader function. The fileDatastore also supports reading only parts of a file at a time. By adding additional logic into our reader function to manage the current state of reading through a large file, we can grab a portion of each variable.

Let's assume that in our collection of MAT files each file contains the same number of variables with the same names. Let's also assume all variables within a particular file are column vectors of the same length. We can then use matfile objects (described in more detail below) within the following reader function to partially read only a certain number of rows from each variable and concatenate them into a table.

function [data,readCounter,done] = partialReadFcn(filename,readCounter)
    % create MAT file object
    m = matfile(filename);
    % initialize readCounter
    if isempty(readCounter)
        readCounter = 0;
    end
    % default read size in number of rows
    readSize = 3e4;
    % number of rows in the column vectors
    arrayLength = size(m,"x",1);
    if (arrayLength - readCounter*readSize) > readSize
        % if there's more left to read than readSize, we're not done...
        done = false;
    else
        % ...otherwise we are
        done = true;
        % adjust readSize to finish file
        readSize = arrayLength - readCounter*readSize;
    end
    readRange = (1 + readSize*readCounter) : (readSize+readSize*readCounter);
    readCounter = readCounter+1;
    % read portion of all variables
    varNames = who("-file",filename);
    data = nan(readSize,length(varNames));
    for ii = 1:length(varNames)
        data(:,ii) = m.(varNames{ii})(readRange,1);
    end
    data = array2table(data,"VariableNames",varNames);
end

x = rand(1e6,1);
y = rand(1e6,1);
z = rand(1e6,1);
save smallVars1.mat x y z -v7.3
save smallVars2.mat x y z -v7.3
fds_partial = fileDatastore("smallVars*.mat", "ReadFcn", @partialReadFcn, "UniformRead", true, "ReadMode", "partialfile");
% reads number of rows specified in reader function
t_partial = read(fds_partial);
size(t_partial)

ans =
       30000           3

The partial reading of fileDatastore lets you parse arbitrarily large files with an arbitrary reader function. If you want even more control over how a datastore processes a data source, consider using custom datastores. By using custom datastores you get access to the low-level tools that MathWorks developers use when developing datastores for new data sources. While they can be challenging to write from scratch, custom datastores give you complete control over how the datastore behaves.

Large MAT Files with Large Variables

MATFILE Objects

Up through Version 7 MAT files, individual variables are limited to 2GB in size. Version 7.3 removes this restriction, allowing variables to be arbitrarily large. MATLAB's matfile objects enable users to access and change variables stored in Version 7.3 MAT files without loading the entire variable into memory.

save mydata_7_3.mat a b c -v7.3
m = matfile("mydata_7_3.mat","Writable",true);
% read only first three rows of variable "c"
m.c(1:3,:)

ans =
     1     0     0     0     0     0     0     0     0     0
     0     1     0     0     0     0     0     0     0     0
     0     0     1     0     0     0     0     0     0     0

% write values from "b" to "c"
m.c(1:2,:) = [m.b(1,:); m.b(1,:)];
m.c(1:3,:)

ans =
  Columns 1 through 7
    0.3973    0.0812    0.5761    0.3502    0.3579    0.3944    0.0965
    0.3973    0.0812    0.5761    0.3502    0.3579    0.3944    0.0965
         0         0    1.0000         0         0         0         0
  Columns 8 through 10
    0.1076    0.8506    0.1651
    0.1076    0.8506    0.1651
         0         0         0

These matfile objects can then be used in a loop or combined with a fileDatastore as in the above example to process individual variables that are arbitrarily large.

While matfile objects are easy to use, they have several limitations that restrict the situations where they can be used. The biggest restrictions include:

Partial reading/writing of variables is only supported with Version 7.3 MAT files
Does not support partial reading/writing of some heterogeneous datatypes such as tables, meaning those datatypes must be read or written as whole variables

Rewriting MAT Files to Another Format

If matfile objects don't meet your needs, you could consider the custom datastores mentioned above or refactor the MAT files into another format. Rewriting your data from MAT files to another file format may make sense when you either:

Need functionality not available with MAT files (e.g. splitability)
Need to interchange data with other applications
Need to work with remote datasets, and the local file requirements of fileDatastore are problematic

One such format is Parquet. The Parquet file format is a columnar data storage format designed for the Hadoop ecosystem, though they can be used within any environment. They support splitability, fast I/O performance, and are a common data interchange format. Parquet files are typically kept relatively small as they are meant to fit in the Hadoop Distributed File System's 128MB block size.

As of R2019a MATLAB has built-in support for reading and writing Parquet files. As Parquet is designed for heterogeneous columnar data, it requires a table or timetable variable. You can interact with Parquet files from MATLAB using parquetread, parquetwrite, and parquetinfo.

parquetwrite("parquetData.parquet",t_partial)

If you need to do some processing using tall arrays before rewriting the data, the tall array results can be written directly to Parquet using the tall array write command.

tall_partial = tall(fds_partial);
write("data/p*.parquet",tall_partial,"FileType","parquet")

Writing tall data to folder C:\Work\ArtofMATLAB\AdamF\largeMATfiles\data
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 3.9 sec
Evaluation completed in 4 sec

Once you have rewritten your data to Parquet, you can use the full Parquet dataset with the parquetDatastore. Unlike the fileDatastore, the parquetDatastore does not require making a local copy of each file.

pds = parquetDatastore("data\*.parquet");
t_parquet = read(pds);
tall_parquet = tall(pds);

Switching to another file format does come with its own concerns:

Large amounts of data are being duplicated, which can consume large amounts of both time and disk space.
Some information may be lost when writing to a new format. For example, Parquet files do not preserve the timezone property of datetime values. To maintain the timezone information, you must manually save the information to another variable in your table or timetable before writing it to the Parquet file.

MAT Files Logged from Simulink Simulations

The situations discussed above can arise from many different data sources. One special source of large quantities of MAT file data is data logged from Simulink simulations. In this post we treat this situation differently as MAT files that come from Simulink logging:

Store their data in a special, nested data structure within the MAT file
Can use a simulationDatastore specifically designed for working with large amounts of MAT files logged from Simulink

A simulationDatastore enables a Simulink model to interact with big data. You can load big data as simulation input and log big output data from a simulation. The documentation page for Working with Big Data for Simulations contains details of creating and using data logged from Simulink. The general idea is to start with generating the data from Simulink:

% load Simulink model
load_system("sldemo_fuelsys")
% turn on data logging
set_param("sldemo_fuelsys","LoggingToFile","on")
% run model and log data
sim("sldemo_fuelsys")
% close model without saving
close_system("sldemo_fuelsys",0)

Once your data is logged, use DatasetRef to access the simulationDatastores.

DSRef = Simulink.SimulationData.DatasetRef("out.mat","sldemo_fuelsys_output");
% return a simulationDatastore for fuel signal
ds = DSRef.getAsDatastore("fuel").Values;
% take a single read of the fuel signal from the MAT file
t_sim = read(ds);
% treat all the fuel data as a tall variable
tall_t_sim = tall(ds);

Summary

In this post we explored several different situations and solutions when dealing with big data in MAT files. Many of the solutions we explored can be mixed and combined together. General recommendations for which tool to start with are summarized in the figure below. Leave a comment here and let us know what enhancements you would like to see in the next version of MAT files.