Big Data in MAT Files
Today's guest blogger is Adam Filion, a Senior Product Manager at MathWorks. Adam helps manage and prioritize our development efforts in data science and big data.
MAT files are an easy and common way to store MATLAB variables to disk. They support all MATLAB variable types, have good data compression, and can be accessed or created from other applications through an external API. MATLAB users sometimes have so much data stored in MAT files that they can't load all the data at once. In this post, we will explore different situations and solutions for analyzing large amounts of data stored in MAT files.
Contents
Introduction to MAT Files
Using MAT files
MATLAB provides the ability to save variables to MAT files through the save command.
a = pi; b = rand(1,10); save mydata.mat a b
These variables can be returned to the workspace using load.
% return all variables to the workspace load mydata.mat % find which variables are contained in a MAT file varNames = who("-file","mydata.mat"); % return only the second variable to the workspace load("mydata.mat",varNames{2})
MAT file versions
MAT files have evolved over time and several different versions exist. You can change the version to use when saving data by passing an additional flag, such as "-v7.3", to the save command. The biggest differences are summarized in the table below.
Version 7 is the default and should be used unless you need the additional functionality provided in Version 7.3. This is because, as mentioned in the documentation, Version 7.3 contains additional header information and may result in larger files than Version 7 when storing small amounts of data. Only create Version 6 or Version 4 MAT files if you need compatibility with older legacy applications.
When to store big data in MAT files
Most users analyzing large amounts of MAT file data did not choose the storage format themselves; but if you could, when would it make sense to store big data in MAT files? They are a good choice when the following three conditions apply:
- The data is originally recorded in MAT files. This occurs when saving variables from the MATLAB workspace, logging data from Simulink simulations, or recording data from certain third-party data loggers which generate MAT files automatically. If your data does not naturally come in MAT files, it usually should be left in its original format.
- The data is staying in the MATLAB ecosystem. MAT files are simple to use and are a lossless storage format, meaning that you will never lose any information or accuracy when storing MATLAB variables. However, since they are not easily accessible from other applications, it is typically better to use another file format (e.g. csv, Parquet, etc.) when exchanging data with other applications.
- MAT files easily work in your file storage system. Some file systems impose additional requirements on files stored within them. For example, in the Hadoop Distributed File System (HDFS) it is difficult to use files that are not splitable, which is a feature MAT files do not support. In such situations, you should consider if a different file format that supports the file system requirements would be a better choice.
Big data in MAT file situations
If all your MAT file data can be easily loaded into memory and analyzed at the same time, use the load command outlined at the beginning. For the rest of this post, we will explore the four general situations when MAT file data gets too large to work with at once.
- Large collections of small MAT files
- Large MAT files with many small variables
- Large MAT files with large variables
- MAT files logged from Simulink simulations
Large Collections of Small MAT Files
Often data is recorded from different entities (e.g. weather stations, vehicles, simulations, etc.) and each entity is stored in a separate file. Even if each individual MAT file can easily fit into memory, the total collection can grow large enough that we cannot work with all of it at once. When this happens, there are two solutions based on the type of analysis we need to do.
Embarrassingly Parallel Analysis
If the work we are doing is embarrassingly parallel, meaning that each file can be analyzed in isolation, then we can loop through the files one at a time. If Parallel Computing Toolbox is available, we can accelerate the process by using a parfor loop instead of a for loop.
% find .mat files in current directory d = dir("*.mat"); % loop through with a for loop, or use parfor parfor ii = 1:length(d) % load the next .mat file data = load(d(ii).name); % perform your analysis on each individual file doAnalysis() end
Inherently Sequential Analysis
When our files cannot be analyzed in isolation, we need to change our approach. The fileDatastore gives access to large collections of files by using a custom file reader function. For example, if your analysis only needs the variable "b" from your MAT files, you can use a reader function such as:
function data = myReader(fileName,varName) matData = load(fileName,varName); data = matData.(varName); end
If your data is stored in more complicated or irregular formats, you can use any arbitrary code in your reader function to return the values in the format you need. Once we define our reader function, we can create a fileDatastore, which will read one file at a time using our reader function.
fds = fileDatastore("*.mat", "ReadFcn", @(fn) myReader(fn,"b"), "UniformRead", true);
Note that by default the fileDatastore will return each file's contents as an element in a cell array. The UniformRead option will instead keep the data's original format and vertically concatenate the data from different files.
After creating the datastore we can read a portion of the dataset with the read method or analyze the full out-of-memory dataset with tall arrays.
% read once from the datastore t = read(fds); % create a tall array tall_t = tall(fds);
Unlike the load command, fileDatastore also supports remote storage systems including Amazon S3, Azure Blob Storage and the Hadoop Distributed File System. For example, to use Amazon S3 make the following modifications:
setenv("AWS_ACCESS_KEY_ID", "YOUR_AWS_ACCESS_KEY") setenv("AWS_SECRET_ACCESS_KEY", "YOUR_AWS_SECRET_ACCESS_KEY") fds = fileDatastore("s3://bucketname/dataset/*.mat", "ReadFcn", @(fn) myReader(fn,"b"), "UniformRead", true);
However, note that the fileDatastore automatically makes a local copy of each file it reads, which may result in downloading the entire dataset when it is stored remotely. If this is problematic, consider rewriting your data to another file format so you can use a datastore that does not require local copies. This is discussed in more detail later in this post.
Large MAT Files with Many Small Variables
MAT files can individually be too large to load either because they have many small variables or have large variables. Files with many small variables arise when logging many signals from simulations or data loggers, or by adding more variables to a MAT file over time using the save command's -append option.
c = eye(10); % add another variable to the file save mydata.mat c -append
Use a Subset of Variables
When working with MAT files containing too many small variables to load all at once, one approach is to only load certain variables needed for your analysis as we did in the prior section. If this reduces the data needed from each individual file such that each call to the read method fits into memory, then we can use the example from the previous section to avoid running out of memory.
Use a Portion of All Variables
However, if even after selecting only the necessary variables the data from individual files is still too large to fit into memory then we must try a different approach. In the prior section we used fileDatastore to read entire MAT files with a custom reader function. The fileDatastore also supports reading only parts of a file at a time. By adding additional logic into our reader function to manage the current state of reading through a large file, we can grab a portion of each variable.
Let's assume that in our collection of MAT files each file contains the same number of variables with the same names. Let's also assume all variables within a particular file are column vectors of the same length. We can then use matfile objects (described in more detail below) within the following reader function to partially read only a certain number of rows from each variable and concatenate them into a table.
function [data,readCounter,done] = partialReadFcn(filename,readCounter) % create MAT file object m = matfile(filename); % initialize readCounter if isempty(readCounter) readCounter = 0; end % default read size in number of rows readSize = 3e4; % number of rows in the column vectors arrayLength = size(m,"x",1); if (arrayLength - readCounter*readSize) > readSize % if there's more left to read than readSize, we're not done... done = false; else % ...otherwise we are done = true; % adjust readSize to finish file readSize = arrayLength - readCounter*readSize; end readRange = (1 + readSize*readCounter) : (readSize+readSize*readCounter); readCounter = readCounter+1; % read portion of all variables varNames = who("-file",filename); data = nan(readSize,length(varNames)); for ii = 1:length(varNames) data(:,ii) = m.(varNames{ii})(readRange,1); end data = array2table(data,"VariableNames",varNames); end
x = rand(1e6,1); y = rand(1e6,1); z = rand(1e6,1); save smallVars1.mat x y z -v7.3 save smallVars2.mat x y z -v7.3 fds_partial = fileDatastore("smallVars*.mat", "ReadFcn", @partialReadFcn, "UniformRead", true, "ReadMode", "partialfile"); % reads number of rows specified in reader function t_partial = read(fds_partial); size(t_partial)
ans = 30000 3
The partial reading of fileDatastore lets you parse arbitrarily large files with an arbitrary reader function. If you want even more control over how a datastore processes a data source, consider using custom datastores. By using custom datastores you get access to the low-level tools that MathWorks developers use when developing datastores for new data sources. While they can be challenging to write from scratch, custom datastores give you complete control over how the datastore behaves.
Large MAT Files with Large Variables
MATFILE Objects
Up through Version 7 MAT files, individual variables are limited to 2GB in size. Version 7.3 removes this restriction, allowing variables to be arbitrarily large. MATLAB's matfile objects enable users to access and change variables stored in Version 7.3 MAT files without loading the entire variable into memory.
save mydata_7_3.mat a b c -v7.3 m = matfile("mydata_7_3.mat","Writable",true); % read only first three rows of variable "c" m.c(1:3,:)
ans = 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
% write values from "b" to "c"
m.c(1:2,:) = [m.b(1,:); m.b(1,:)];
m.c(1:3,:)
ans = Columns 1 through 7 0.3973 0.0812 0.5761 0.3502 0.3579 0.3944 0.0965 0.3973 0.0812 0.5761 0.3502 0.3579 0.3944 0.0965 0 0 1.0000 0 0 0 0 Columns 8 through 10 0.1076 0.8506 0.1651 0.1076 0.8506 0.1651 0 0 0
These matfile objects can then be used in a loop or combined with a fileDatastore as in the above example to process individual variables that are arbitrarily large.
While matfile objects are easy to use, they have several limitations that restrict the situations where they can be used. The biggest restrictions include:
- Partial reading/writing of variables is only supported with Version 7.3 MAT files
- Does not support partial reading/writing of some heterogeneous datatypes such as tables, meaning those datatypes must be read or written as whole variables
Rewriting MAT Files to Another Format
If matfile objects don't meet your needs, you could consider the custom datastores mentioned above or refactor the MAT files into another format. Rewriting your data from MAT files to another file format may make sense when you either:
- Need functionality not available with MAT files (e.g. splitability)
- Need to interchange data with other applications
- Need to work with remote datasets, and the local file requirements of fileDatastore are problematic
One such format is Parquet. The Parquet file format is a columnar data storage format designed for the Hadoop ecosystem, though they can be used within any environment. They support splitability, fast I/O performance, and are a common data interchange format. Parquet files are typically kept relatively small as they are meant to fit in the Hadoop Distributed File System's 128MB block size.
As of R2019a MATLAB has built-in support for reading and writing Parquet files. As Parquet is designed for heterogeneous columnar data, it requires a table or timetable variable. You can interact with Parquet files from MATLAB using parquetread, parquetwrite, and parquetinfo.
parquetwrite("parquetData.parquet",t_partial)
If you need to do some processing using tall arrays before rewriting the data, the tall array results can be written directly to Parquet using the tall array write command.
tall_partial = tall(fds_partial); write("data/p*.parquet",tall_partial,"FileType","parquet")
Writing tall data to folder C:\Work\ArtofMATLAB\AdamF\largeMATfiles\data Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: Completed in 3.9 sec Evaluation completed in 4 sec
Once you have rewritten your data to Parquet, you can use the full Parquet dataset with the parquetDatastore. Unlike the fileDatastore, the parquetDatastore does not require making a local copy of each file.
pds = parquetDatastore("data\*.parquet");
t_parquet = read(pds);
tall_parquet = tall(pds);
Switching to another file format does come with its own concerns:
- Large amounts of data are being duplicated, which can consume large amounts of both time and disk space.
- Some information may be lost when writing to a new format. For example, Parquet files do not preserve the timezone property of datetime values. To maintain the timezone information, you must manually save the information to another variable in your table or timetable before writing it to the Parquet file.
MAT Files Logged from Simulink Simulations
The situations discussed above can arise from many different data sources. One special source of large quantities of MAT file data is data logged from Simulink simulations. In this post we treat this situation differently as MAT files that come from Simulink logging:
- Store their data in a special, nested data structure within the MAT file
- Can use a simulationDatastore specifically designed for working with large amounts of MAT files logged from Simulink
A simulationDatastore enables a Simulink model to interact with big data. You can load big data as simulation input and log big output data from a simulation. The documentation page for Working with Big Data for Simulations contains details of creating and using data logged from Simulink. The general idea is to start with generating the data from Simulink:
% load Simulink model load_system("sldemo_fuelsys") % turn on data logging set_param("sldemo_fuelsys","LoggingToFile","on") % run model and log data sim("sldemo_fuelsys") % close model without saving close_system("sldemo_fuelsys",0)
Once your data is logged, use DatasetRef to access the simulationDatastores.
DSRef = Simulink.SimulationData.DatasetRef("out.mat","sldemo_fuelsys_output"); % return a simulationDatastore for fuel signal ds = DSRef.getAsDatastore("fuel").Values; % take a single read of the fuel signal from the MAT file t_sim = read(ds); % treat all the fuel data as a tall variable tall_t_sim = tall(ds);
Summary
In this post we explored several different situations and solutions when dealing with big data in MAT files. Many of the solutions we explored can be mixed and combined together. General recommendations for which tool to start with are summarized in the figure below. Leave a comment here and let us know what enhancements you would like to see in the next version of MAT files.
- 类别:
- Big Data
评论
要发表评论,请点击 此处 登录到您的 MathWorks 帐户或创建一个新帐户。