Working efficiently with data: Parquet files and the Needle in a Haystack problem

Posted by Mike Croucher, May 5, 2023

117 views (last 30 days) | 0 Likes | 6 comments

This is a guest post by Onomitra Ghosh who is the Product Manager for MATLAB Data Analysis and Big Data workflows.

How big does data need to be before it can be called big? While there is no definite number beyond which data can be called big data, it's safe to say that when it becomes uncomfortably large for the memory of your computer, it is getting big. One way to tackle big data is to throw a lot of processing power at it (either locally or in cloud). But quite often, when working with big data, we do not always need all of it for our analysis. So, a smarter first step is to read only the data that is needed. This is especially true for the "needle in a haystack" problem where we need to find a small slice of information in an ocean of data. In this blog, we will show how using Parquet files for storing big data and performing read-time filtering can make the subsequent analysis more efficient than storing and reading them from conventional file formats like csv, spreadsheets etc.

A quick introduction to Parquet

Before we get too far ahead, let's do a quick introduction to Parquet file format. Parquet is in open-source column-oriented data storage format developed and maintained as a part of the Apache Software Foundation. Usage of Parquet files for big data has been steadily growing since its inception because they are very efficient to store and read data.

At the lowest level, a Parquet file stores data in a columnar format. Unlike more traditional row-based storages, Parquet files store data of each column together. As a result, if we want to read specific columns, we can read a contiguous set of data instead of spending time skipping and seeking or loading the whole file and then selecting specific columns in memory. Selecting and reading specific columns is also called projection pushdown.

Along with reading specific columns, often we also only need a subset of rows for our analysis. To support this, Parquet allows writing the data into row groups. While within each row group data is laid out in a columnar format, the row groups themselves separate out the slices of data that can be useful for filtering based on specific row values.

In addition, each Parquet file maintains a set of metadata about the file, each row group, and each column. This may contain information about data type, codec, number of values, read offset etc. These metadata help to quickly seek the right set of rows to read for a specific filtering condition. This is also called predicate pushdown. These pushdown capabilities help to filter the data while reading from the file instead of reading the entire dataset into memory and then filtering it. Last but not the least, Parquet also allows efficient compression techniques. Together these capabilities help in storing large amount of data with a relatively smaller footprint and read only the values that are needed. 

In this blog, we are going to focus on how writing out a flat tabular data into Parquet files using row groups can help in faster data analysis. To keep things simple, we will not be exploring any kinds of parallelization (that's for a future blog). Parallelization using Parallel Computing Toolbox can make the code run even faster. But today, we will measure and highlight the performance impact of simply using Parquet files instead of csv while analyzing data with just your desktop machine resources.

Dataset

The data we are going to use is a modified subset of the Flight Data from Dashlink. We are going to work with the flight data for Tail 660. The dataset contains information of 4578 unique flights flown by this plane between 2001 to 2010. Each flight data is stored in a separate CSV file. Individually each of these files are not very big. But together there are about 22 million rows of flight information contained in these 4578 files. Also, each file has different flight sensor information stored in its 87 columns. The entire dataset is about 15GB in memory. While, this is not terabytes or petabytes of data, it is still big enough that can make analyzing it in a standard computer uncomfortable and frustrating. 

Problem statement

For our analysis problem, we want to find if any of these 4578 flights flew over Massachusetts. Two of the flight sensors record the latitude and longitude information. We can use this information to track the flight path for each of these flights. Let's read a sample file and plot the flight path.

exampleCSVFile = "data\csv\660200110191613_1HZ.csv";

tcsv = readtable(exampleCSVFile);

head(tcsv)

     TripNo                Time               ABRK     ACMT    AIL_1     AIL_2     ALTS    APFD    ATEN    A_T    BLV    BPGR_1    BPGR_2    BPYR_1    BPYR_2    CALT    CASS     CRSS      DFGS    DWPT     EAI    ELEV_1    ELEV_2    EVNT    FADF    FADS    FGC3    FIRE_1    FIRE_2    FIRE_3    FIRE_4    FLAP    FQTY_1    FQTY_2    FQTY_3    FQTY_4      GLS       GPWS     HDGS      HF1    HF2    HYDG    HYDY      ILSF        LATP     LGDN    LGUP    LMOD      LOC       LONP     MNS    MRK    MW    N1CO    OIPL    OIP_1    OIP_2    OIP_3    OIP_4    OIT_1     OIT_2     OIT_3     OIT_4     PACK    PH    POVT     PTRM     PUSH    SAT    SMKB    SMOK    SNAP    SPLG    SPLY    SPL_1     SPL_2    TAI    TAT     TCAS    TMAG    TMODE    VHF1    VHF2    VHF3    VMODE    VSPS     WAI_1    WAI_2    WOW    WSHR
    _________    ________________________    ______    ____    ______    ______    ____    ____    ____    ___    ___    ______    ______    ______    ______    ____    ____    _______    ____    _____    ___    ______    ______    ____    ____    ____    ____    ______    ______    ______    ______    ____    ______    ______    ______    ______    ________    ____    _______    ___    ___    ____    ____    _________    ______    ____    ____    ____    ________    _____    ___    ___    __    ____    ____    _____    _____    _____    _____    ______    ______    ______    ______    ____    __    ____    ______    ____    ___    ____    ____    ____    ____    ____    ______    _____    ___    ____    ____    ____    _____    ____    ____    ____    _____    _____    _____    _____    ___    ____

602e+14    19-Oct-2001 16:13:18.000    119.32     60     80.422    80.668    4000     2       0       1      0       0        34.18    58.594      0        0      134     -70.048     1      32924     0     17.551    65.979     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00195     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36868    -76.2    201     7     0      1       0        0        0        0        0      26.361    25.018    18.302    20.989     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.275    72.41     0        3     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:19.000    119.32     59     80.402    80.688    4000     2       0       1      0       0       29.297    39.063      0        0      134     -70.048     1      32924     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.03939     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.37181    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    20.989     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.275    72.41     0        3     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:20.000    119.31     60     80.402    80.668    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1      32924     0     17.571    65.979     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00819     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.37162    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    20.989     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275    72.41     0        3     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:21.000    119.32     60     80.402    80.668    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1      32924     0     17.551    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00156     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.37024    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275     72.4     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:22.000    119.31     60     80.402    80.688    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1      32924     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00156     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36593    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.275     72.4     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:23.000    119.31     60     80.402    80.668    4000     2       0       1      0       0       29.297    43.945      0        0      134     -70.048     1      32924     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048      0.00585     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36907    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275     72.4     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:24.000    119.32     59     80.402    80.668    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1          0     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048       0.0039     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36907    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275    72.41     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:25.000    119.31     60     80.402    80.668    4000     2       0       1      0       0       29.297    43.945      0        0      134     -70.048     1      32928     0     17.592    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048      0.01092     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36672    -76.2    201     7     0      1       0        0        0        0        0      27.704    25.018    19.645    22.332     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.266    72.41     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  

We use geoplot to plot the flight path.

geoplot(tcsv.LATP,tcsv.LONP,'Color','b','LineWidth',3);

geobasemap colorterrain

Looks like this particular flight flew from Norfolk, VA to Minneapolis, MN. To find flights that flew over Massachusetts, we have to compare each flight's path (from each of those 4578 files!) against Massachusetts' coordinates. We will assume the following as a rough estimate of Massachusetts latitude and longitude.

MALatMin = 42.03;

MALatMax = 42.72;

MALonMin = -73.37;

MALonMax = -70.06;

MACoordinates = [MALatMin, MALatMax, MALonMin, MALonMax];

By the way, this a true "Needle in a haystack" problem. There are only 301 out of 22 million flight records (0.001%) spread over these 4578 files that has Massachusetts latitude and longitudes in this dataset. Our exercise will be to find these 301 rows as efficiently as possible.

CSV vs. Parquet: Single file

Let's start by understanding the differences between csv and Parquet a little bit. We have already looked at how to read the csv file into a MATLAB table. Now let's try the same with Parquet. First, let's write out this flight information in a Parquet file. The simplest way to write to a Parquet file is by calling the parquetwrite function. We can then check the information of the Parquet file using parquetinfo.

exampleParquetFile = "exampleparquet.parquet";

parquetwrite(exampleParquetFile, tcsv); 

parquetinfo(exampleParquetFile)

ans = 
  ParquetInfo with properties:

               Filename: "C:\Blogs\Introduction to Parquet\exampleparquet.parquet"
               FileSize: 703457
           NumRowGroups: 1
        RowGroupHeights: 14136
          VariableNames: ["TripNo"    "Time"    "ABRK"    "ACMT"    "AIL_1"    "AIL_2"    "ALTS"    "APFD"    "ATEN"    "A_T"    "BLV"    "BPGR_1"    "BPGR_2"    "BPYR_1"    "BPYR_2"    "CALT"    "CASS"    "CRSS"    "DFGS"    "DWPT"    "EAI"    …    ]
          VariableTypes: ["double"    "datetime"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    "double"    …    ]
    VariableCompression: ["snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    "snappy"    …    ]
       VariableEncoding: ["dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    "dictionary"    …    ]
                Version: "2.0"

Let's read the file back. We can use parquetread function to read data from Parquet files into a MATLAB table.

tparquet = parquetread(exampleParquetFile);

head(tparquet)

     TripNo              Time             ABRK     ACMT    AIL_1     AIL_2     ALTS    APFD    ATEN    A_T    BLV    BPGR_1    BPGR_2    BPYR_1    BPYR_2    CALT    CASS     CRSS      DFGS    DWPT     EAI    ELEV_1    ELEV_2    EVNT    FADF    FADS    FGC3    FIRE_1    FIRE_2    FIRE_3    FIRE_4    FLAP    FQTY_1    FQTY_2    FQTY_3    FQTY_4      GLS       GPWS     HDGS      HF1    HF2    HYDG    HYDY      ILSF        LATP     LGDN    LGUP    LMOD      LOC       LONP     MNS    MRK    MW    N1CO    OIPL    OIP_1    OIP_2    OIP_3    OIP_4    OIT_1     OIT_2     OIT_3     OIT_4     PACK    PH    POVT     PTRM     PUSH    SAT    SMKB    SMOK    SNAP    SPLG    SPLY    SPL_1     SPL_2    TAI    TAT     TCAS    TMAG    TMODE    VHF1    VHF2    VHF3    VMODE    VSPS     WAI_1    WAI_2    WOW    WSHR
    _________    ____________________    ______    ____    ______    ______    ____    ____    ____    ___    ___    ______    ______    ______    ______    ____    ____    _______    ____    _____    ___    ______    ______    ____    ____    ____    ____    ______    ______    ______    ______    ____    ______    ______    ______    ______    ________    ____    _______    ___    ___    ____    ____    _________    ______    ____    ____    ____    ________    _____    ___    ___    __    ____    ____    _____    _____    _____    _____    ______    ______    ______    ______    ____    __    ____    ______    ____    ___    ____    ____    ____    ____    ____    ______    _____    ___    ____    ____    ____    _____    ____    ____    ____    _____    _____    _____    _____    ___    ____

602e+14    19-Oct-2001 16:13:18    119.32     60     80.422    80.668    4000     2       0       1      0       0        34.18    58.594      0        0      134     -70.048     1      32924     0     17.551    65.979     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00195     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36868    -76.2    201     7     0      1       0        0        0        0        0      26.361    25.018    18.302    20.989     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.275    72.41     0        3     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:19    119.32     59     80.402    80.688    4000     2       0       1      0       0       29.297    39.063      0        0      134     -70.048     1      32924     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.03939     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.37181    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    20.989     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.275    72.41     0        3     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:20    119.31     60     80.402    80.668    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1      32924     0     17.571    65.979     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00819     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.37162    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    20.989     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275    72.41     0        3     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:21    119.32     60     80.402    80.668    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1      32924     0     17.551    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00156     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.37024    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275     72.4     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:22    119.31     60     80.402    80.688    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1      32924     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048     -0.00156     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36593    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.275     72.4     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:23    119.31     60     80.402    80.668    4000     2       0       1      0       0       29.297    43.945      0        0      134     -70.048     1      32924     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048      0.00585     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36907    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275     72.4     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:24    119.32     59     80.402    80.668    4000     2       0       1      0       0        34.18    43.945      0        0      134     -70.048     1          0     0     17.571    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048       0.0039     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36907    -76.2    201     7     0      1       0        0        0        0        0      26.361    26.361    19.645    22.332     2      1      0      3.7287     1      2.5     1       0       1       1       1      72.275    72.41     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  
602e+14    19-Oct-2001 16:13:25    119.31     60     80.402    80.668    4000     2       0       1      0       0       29.297    43.945      0        0      134     -70.048     1      32928     0     17.592    65.959     1       15      15     120       0         0         0         0        98      8216      4464       0        8048      0.01092     1      -131.04     1      1      0       0      1.189e+06    36.897     0       1       12     -0.36672    -76.2    201     7     0      1       0        0        0        0        0      27.704    25.018    19.645    22.332     2      1      0      3.7082     1      2.5     1       0       1       1       1      72.266    72.41     0     2.75     0       1        2       1       1       1       11      -2000      0        0       0      1  

The CSV file is over 8Mb!

s = dir(exampleCSVFile); 

disp("CSV file size: " + s.bytes/1024/1024 + "MB")

CSV file size: 8.3831MB

While the Parquet file is much smaller at only 0.67Mb.

s = dir(exampleParquetFile); 

disp("Parquet file size: " + s.bytes/1024/1024 + "MB")

Parquet file size: 0.67087MB

We can immediately observe one of the benefits of using Parquet. Just simply writing this dataset as a Parquet file helps in reducing the file size on disk by approximately 12x compared to CSV. 

tstart = tic;readtable(exampleCSVFile);tend = toc(tstart)

tend = 0.4944

tstart = tic;parquetread(exampleParquetFile);tend = toc(tstart)

tend = 0.0814

Even the read time is 6x faster when reading from Parquet vs. csv. Note, that the exact speed up and file size will depend on many factors including the kind of data contained in the files, how many row groups are formed etc.

Datastore, Tall and Rowfilter

While reading data from a single Parquet file was faster, the real benefit of using Parquet files is realized when the dataset is big. But we cannot use readtable and parquetread when working with data spread over multiple files like this one. Instead, we will use the following functionalities in MATLAB that can help in working with big and distributed data.

1. Datastores are a way to read big and distributed data in MATLAB. A datastore does not directly read the data. Instead, it holds information about which files to read and how to read them when requested. This is very helpful in iteratively reading a large collection of files instead of loading all of it at once. MATLAB has datastores for various file formats, including text, spreadsheet, image, Parquet, and even custom format files. For a full list of types of datastores, see documentation on available datastores. In this exercise we will use tabularTextDatastore for processing the csv files and parquetDatastore for the parquet files.

2. Tall is a lazy evaluation framework in MATLAB which is created based on a datastore. The advantage of using tall is that we can take the same code that runs on regular MATLAB tables and run on a tall table backed by the datastore. We don't need to learn a whole lot of new syntaxes and functions to work with bigger data. But, because of its lazy evaluation, tall code is not executed immediately. Instead, when the gather function is called, tall reads the data from the underlying datastore and runs the operations as they are read. Tall supports hundreds of data analysis and machine learning functions that work with MATLAB tables and timetables. Tall tables can also be indexed like regular tables to find the relevant rows and columns.

3. Rowfilter is a relatively new concept in MATLAB introduced in R2022a. A rowfilter is a MATLAB object that helps to specify which rows to import. For Parquet files, rowfilter takes advantage of predicate pushdown and filters the rows at the file level using Parquet's rowgroup information. Rowfilters can be specified in parquetread and parquetDatastore. But the nice thing about using tall is that we we don't need to specify rowfilter explicitly. If we index tall tables to find relevant rows and columns (like a regular MATLAB table), it automatically uses rowfilter to perform read-time filtering on Parquet data without us having to write any additional code.

Data reading strategies

For this exercise, we will consider 3 different data reading strategies. For each of them, we will first create a datastore (ds) and then a tall table (tds) on the datastore. We will then use tds to index and read the data (shown in the following code examples).

Read strategy 1: First read the entire data into memory and then find the relevant flights over MA (Of course this will be slow and ill-advised, but just for comparison's sake)

allData = gather(tds);

flightsOverMA = allData(allData.LATP>coordinates(1) & allData.LATP<coordinates(2) & allData.LONP>coordinates(3) & allData.LONP<coordinates(4),["TripNo","LATP","LONP"]);

Read strategy 2: Filter the data for the matching latitudes and longitudes at read-time and read all columns for the filtered rows. Then choose the relevant columns.

flightsOverMAtall = tds(tds.LATP>coordinates(1) & tds.LATP<coordinates(2) & tds.LONP>coordinates(3) & tds.LONP<coordinates(4),:); 

flightsOverMAAllColumns = gather(flightsOverMAtall);

flightsOverMA = flightsOverMAAllColumns(:,["TripNo","LATP","LONP"])

Read strategy 3: Read only relevant columns for matching latitudes and longitudes (all filtering happens at read-time)

flightsOverMAtall = tds(tds.LATP>coordinates(1) & tds.LATP<coordinates(2) & tds.LONP>coordinates(3) & tds.LONP<coordinates(4),["TripNo","LATP","LONP"]); 

flightsOverMA = gather(flightsOverMAtall);

Since we have to do the same reads twice (csv and Parquet), the above read codes are converted to convenient helper functions at the end of this blog.

Also note that when tall tables are evaluated (using gather), it displays the time taken by the evaluation passes. However, the actual data analysis time for each of the above strategies are slightly more than just the gather command. They are captured using tic/toc around the complete read and filtering code. 

CSV

First let's trying solving our problem by using csv files. To read data from the dataset, we will use Tall on tabularTextDatastore. Note that neither of these objects actually reads the entire dataset. Datastore just stores information on how to read the files and the tall table reads the first few lines to show us what the data looks like.

ds = tabularTextDatastore("data\csv\*.csv")

ds = 
  TabularTextDatastore with properties:

                      Files: {
                             'C:\Blogs\Introduction to Parquet\data\csv\660200104100743_1HZ.csv';
                             'C:\Blogs\Introduction to Parquet\data\csv\660200104100915_1HZ.csv';
                             'C:\Blogs\Introduction to Parquet\data\csv\660200104101125_1HZ.csv'
                              ... and 4575 more
                             }
                    Folders: {
                             'C:\Blogs\Introduction to Parquet\data\csv'
                             }
               FileEncoding: 'UTF-8'
   AlternateFileSystemRoots: {}
         VariableNamingRule: 'modify'
          ReadVariableNames: true
              VariableNames: {'TripNo', 'Time', 'ABRK' ... and 87 more}
             DatetimeLocale: en_US

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%f', '%{dd-MMM-uuuu HH:mm:ss.SSS}D', '%f' ... and 87 more}
                   TextType: 'char'
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TripNo', 'Time', 'ABRK' ... and 87 more}
            SelectedFormats: {'%f', '%{dd-MMM-uuuu HH:mm:ss.SSS}D', '%f' ... and 87 more}
                   ReadSize: 20000 rows
                 OutputType: 'table'
                   RowTimes: []

  Write-specific Properties:
     SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"]
        DefaultOutputFormat: "txt"

tds = tall(ds)

tds =

  M×90 tall table

     TripNo                Time               ABRK     ACMT    AIL_1     AIL_2     ALTS    APFD    ATEN    A_T    BLV    BPGR_1    BPGR_2    BPYR_1    BPYR_2    CALT    CASS     CRSS     DFGS       DWPT       EAI    ELEV_1    ELEV_2    EVNT    FADF    FADS    FGC3    FIRE_1    FIRE_2    FIRE_3    FIRE_4    FLAP    FQTY_1    FQTY_2    FQTY_3    FQTY_4      GLS       GPWS     HDGS     HF1    HF2    HYDG    HYDY       ILSF        LATP     LGDN    LGUP    LMOD      LOC        LONP      MNS    MRK    MW    N1CO    OIPL    OIP_1    OIP_2    OIP_3    OIP_4    OIT_1     OIT_2     OIT_3     OIT_4     PACK    PH    POVT     PTRM     PUSH    SAT    SMKB    SMOK    SNAP    SPLG    SPLY    SPL_1     SPL_2     TAI     TAT     TCAS    TMAG    TMODE    VHF1    VHF2    VHF3    VMODE    VSPS    WAI_1    WAI_2    WOW    WSHR
    _________    ________________________    ______    ____    ______    ______    ____    ____    ____    ___    ___    ______    ______    ______    ______    ____    ____    ______    ____    __________    ___    ______    ______    ____    ____    ____    ____    ______    ______    ______    ______    ____    ______    ______    ______    ______    ________    ____    ______    ___    ___    ____    ____    __________    ______    ____    ____    ____    ________    _______    ___    ___    __    ____    ____    _____    _____    _____    _____    ______    ______    ______    ______    ____    __    ____    ______    ____    ___    ____    ____    ____    ____    ____    ______    ______    ___    _____    ____    ____    _____    ____    ____    ____    _____    ____    _____    _____    ___    ____

    6.602e+14    10-Apr-2001 07:43:24.000    119.98     59     72.239    73.671    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.03861     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.35064    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     2      1      0      3.1354     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:25.000    119.98     60     72.239    74.142    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.188    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952     -0.01053     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.32301    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     2      1      0       3.074     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:26.000    119.98     59     72.362    74.285    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.188    64.199     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.05772     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12      -0.3677    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    26.361    27.704     2      1      0      3.0536     1      19      1       0       1       1       1      72.429     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:27.000    119.98     59     72.423    74.367    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        99      8112       0         0        7952       0.0078     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.32948    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    26.361    27.704     2      1      0      3.0536     1      19      1       0       1       1       1      72.429     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:28.000    119.98     60     72.444    74.326    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.02301     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.39396    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     2      1      0       3.074     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:29.000    119.98     59     72.403    74.305    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.188     64.22     1       15      15     120       0         0         0         0        98      8104       0         0        7952     -0.13689     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.34672    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     0      1      0      3.0536     1      19      1       0       1       1       1      72.439     72.65     0      19.5     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:30.000    119.98     60     72.403    74.326    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.167    64.261     1       15      15     120       0         0         0         0        98      8112       0         0        7952       0.1794     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.38416    -93.203    204     7     0      1       0        0        0        0        0      27.704    27.704    27.704    27.704     0      1      0       3.074     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:31.000    119.98     59     72.239    74.224    7000     2       0       1      0       0       24.414    19.531      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.02769     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.36632    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     0      1      0      3.0536     1      19      1       0       1       1       1      72.439    72.641     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
        :                   :                  :        :        :         :        :       :       :       :      :       :         :         :         :        :       :        :        :          :          :       :         :        :       :       :       :        :         :         :         :        :        :         :         :         :          :         :        :        :      :      :       :          :           :        :       :       :         :           :        :      :     :      :       :        :        :        :        :        :         :         :         :        :      :      :        :        :       :      :       :       :       :       :        :         :        :       :       :       :        :       :       :       :        :       :        :        :       :      :
        :                   :                  :        :        :         :        :       :       :       :      :       :         :         :         :        :       :        :        :          :          :       :         :        :       :       :       :        :         :         :         :        :        :         :         :         :          :         :        :        :      :      :       :          :           :        :       :       :         :           :        :      :     :      :       :        :        :        :        :        :         :         :         :        :      :      :        :        :       :      :       :       :       :       :        :         :        :       :       :       :        :       :       :       :        :       :        :        :       :      :

Let's try our 3 read strategies next.

CSV Read Strategy 1 - Read the entire data into memory

[flightsOverMA, csvTime1] = findFlightsAllData(tds, MACoordinates);

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 9 min 42 sec
Evaluation completed in 9 min 42 sec

head(flightsOverMA)

     TripNo       LATP      LONP  
    _________    ______    _______

602e+14    42.032    -73.323
602e+14    42.033    -73.323
602e+14    42.035    -73.323
602e+14    42.037    -73.323
602e+14    42.039    -73.323
602e+14     42.04    -73.323
602e+14    42.042    -73.323
602e+14    42.044    -73.323

height(flightsOverMA)

ans = 301

csvTime1

csvTime1 = 582.7319

CSV Read Strategy 2 - Read all columns for matching latitudes and longitudes

[flightsOverMA, csvTime2] = findFlightsFilteredRowsAllColumns(tds, MACoordinates);

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 8 min 33 sec
Evaluation completed in 8 min 33 sec

height(flightsOverMA)

ans = 301

csvTime2

csvTime2 = 514.1776

CSV Read Strategy 3 - Read only relevant columns for matching latitudes and longitudes

[flightsOverMA, csvTime3] = findFlightsFilteredRowsAndColumns(tds,MACoordinates);

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 4 min 3 sec
Evaluation completed in 4 min 3 sec

height(flightsOverMA)

ans = 301

csvTime3

csvTime3 = 242.9099

We were able to find the 301 rows with Massachusetts coordinates with each reading strategy but it can take up to 10 min to read the csv dataset and perform this read. 

Parquet

Analyzing this data directly from csv files was painstakingly slow. Let's explore how using Parquet files instead can make a significant difference in our analysis time. First, we will rewrite the data into Parquet files. There are different ways to write Parquet files in MATLAB. We can use the writeall method of datastore to write all the data that are being read into new file types. However, this will write a Parquet file for each individual csv file. Parquet files of such small sizes are not going to be very helpful. Instead we would like to coalesce the flight information into larger Parquet files that can still be read efficiently into MATLAB. This can be done using the write method for tall and using parquetwrite as a custom write function. 

write("data/parquet/", tds, "WriteFcn", @dataWriter);

function dataWriter(info, data )

    filename = info.SuggestedFilename;

    parquetwrite(filename, data)

end

d = dir("data/parquet/");

struct2table(d)

ans = 14×6 table 
 namefolderdatebytesisdirdatenum
1'.''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 13:00:15'017.3899e+05
2'..''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:37:14'017.3899e+05
3'data_1_000001.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:46:28'6313359507.3899e+05
4'data_1_000002.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:48:08'6411108707.3899e+05
5'data_1_000003.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:49:38'6479999607.3899e+05
6'data_1_000004.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:51:02'6421288707.3899e+05
7'data_1_000005.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:52:23'6208287007.3899e+05
8'data_1_000006.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:53:22'4584322707.3899e+05
9'data_2_000001.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:54:35'6320024607.3899e+05
10'data_2_000002.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:55:49'6320025007.3899e+05
11'data_2_000003.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:57:03'6422917907.3899e+05
12'data_2_000004.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:58:18'6216501707.3899e+05
13'data_2_000005.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 12:59:31'6124165907.3899e+05
14'data_2_000006.parquet''C:\Blogs\Introduction to Parquet\data\parquet''09-Apr-2023 13:00:19'4052658807.3899e+05

	name	folder	date	bytes	isdir	datenum
1	'.'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 13:00:15'	0	1	7.3899e+05
2	'..'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:37:14'	0	1	7.3899e+05
3	'data_1_000001.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:46:28'	63133595	0	7.3899e+05
4	'data_1_000002.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:48:08'	64111087	0	7.3899e+05
5	'data_1_000003.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:49:38'	64799996	0	7.3899e+05
6	'data_1_000004.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:51:02'	64212887	0	7.3899e+05
7	'data_1_000005.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:52:23'	62082870	0	7.3899e+05
8	'data_1_000006.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:53:22'	45843227	0	7.3899e+05
9	'data_2_000001.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:54:35'	63200246	0	7.3899e+05
10	'data_2_000002.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:55:49'	63200250	0	7.3899e+05
11	'data_2_000003.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:57:03'	64229179	0	7.3899e+05
12	'data_2_000004.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:58:18'	62165017	0	7.3899e+05
13	'data_2_000005.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 12:59:31'	61241659	0	7.3899e+05
14	'data_2_000006.parquet'	'C:\Blogs\Introduction to Parquet\data\parquet'	'09-Apr-2023 13:00:19'	40526588	0	7.3899e+05

When written this way, tall coalesced the entire dataset into 12 parquet files with a single rowgroup in each file. Let's now run our tests on the parquet files. 

Just like we used tabularTextDatastore to read the csv files, we can use parquetDatastore to read it from the Parquet files and then create the tall table.

ds = parquetDatastore("data\parquet\*.parquet")

ds = 
  ParquetDatastore with properties:

                       Files: {
                              'C:\Blogs\Introduction to Parquet\data\parquet\data_1_000001.parquet';
                              'C:\Blogs\Introduction to Parquet\data\parquet\data_1_000002.parquet';
                              'C:\Blogs\Introduction to Parquet\data\parquet\data_1_000003.parquet'
                               ... and 9 more
                              }
                     Folders: {
                              'C:\Blogs\Introduction to Parquet\data\parquet'
                              }
    AlternateFileSystemRoots: {}
                    ReadSize: 'rowgroup'

   Properties that control the table returned by preview, read, and readall:
               VariableNames: {1×90 cell}
       SelectedVariableNames: {1×90 cell}
          VariableNamingRule: 'modify'
                  OutputType: 'table'
                    RowTimes: []
                   RowFilter: <unconstrained>

   Properties that control output format when writing:
      SupportedOutputFormats: ["txt"    "csv"    "xlsx"    "xls"    "parquet"    "parq"]
         DefaultOutputFormat: "parquet"

tds = tall(ds)

tds =

  M×90 tall table

     TripNo              Time             ABRK     ACMT    AIL_1     AIL_2     ALTS    APFD    ATEN    A_T    BLV    BPGR_1    BPGR_2    BPYR_1    BPYR_2    CALT    CASS     CRSS     DFGS       DWPT       EAI    ELEV_1    ELEV_2    EVNT    FADF    FADS    FGC3    FIRE_1    FIRE_2    FIRE_3    FIRE_4    FLAP    FQTY_1    FQTY_2    FQTY_3    FQTY_4      GLS       GPWS     HDGS     HF1    HF2    HYDG    HYDY       ILSF        LATP     LGDN    LGUP    LMOD      LOC        LONP      MNS    MRK    MW    N1CO    OIPL    OIP_1    OIP_2    OIP_3    OIP_4    OIT_1     OIT_2     OIT_3     OIT_4     PACK    PH    POVT     PTRM     PUSH    SAT    SMKB    SMOK    SNAP    SPLG    SPLY    SPL_1     SPL_2     TAI     TAT     TCAS    TMAG    TMODE    VHF1    VHF2    VHF3    VMODE    VSPS    WAI_1    WAI_2    WOW    WSHR
    _________    ____________________    ______    ____    ______    ______    ____    ____    ____    ___    ___    ______    ______    ______    ______    ____    ____    ______    ____    __________    ___    ______    ______    ____    ____    ____    ____    ______    ______    ______    ______    ____    ______    ______    ______    ______    ________    ____    ______    ___    ___    ____    ____    __________    ______    ____    ____    ____    ________    _______    ___    ___    __    ____    ____    _____    _____    _____    _____    ______    ______    ______    ______    ____    __    ____    ______    ____    ___    ____    ____    ____    ____    ____    ______    ______    ___    _____    ____    ____    _____    ____    ____    ____    _____    ____    _____    _____    ___    ____

    6.602e+14    10-Apr-2001 07:43:24    119.98     59     72.239    73.671    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.03861     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.35064    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     2      1      0      3.1354     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:25    119.98     60     72.239    74.142    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.188    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952     -0.01053     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.32301    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     2      1      0       3.074     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:26    119.98     59     72.362    74.285    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.188    64.199     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.05772     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12      -0.3677    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    26.361    27.704     2      1      0      3.0536     1      19      1       0       1       1       1      72.429     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:27    119.98     59     72.423    74.367    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        99      8112       0         0        7952       0.0078     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.32948    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    26.361    27.704     2      1      0      3.0536     1      19      1       0       1       1       1      72.429     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:28    119.98     60     72.444    74.326    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.02301     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.39396    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     2      1      0       3.074     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:29    119.98     59     72.403    74.305    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.188     64.22     1       15      15     120       0         0         0         0        98      8104       0         0        7952     -0.13689     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.34672    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     0      1      0      3.0536     1      19      1       0       1       1       1      72.439     72.65     0      19.5     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:30    119.98     60     72.403    74.326    7000     2       0       1      0       0       24.414    24.414      0        0      134     118.92     1      2.0132e+05     0     19.167    64.261     1       15      15     120       0         0         0         0        98      8112       0         0        7952       0.1794     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.38416    -93.203    204     7     0      1       0        0        0        0        0      27.704    27.704    27.704    27.704     0      1      0       3.074     1      19      1       0       1       1       1      72.439     72.65     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
    6.602e+14    10-Apr-2001 07:43:31    119.98     59     72.239    74.224    7000     2       0       1      0       0       24.414    19.531      0        0      134     118.92     1      2.0132e+05     0     19.208    64.179     1       15      15     120       0         0         0         0        98      8112       0         0        7952      0.02769     1      118.92     1      1      0       0      2.1577e+06    44.882     0       1       12     -0.36632    -93.203    204     7     0      1       0        0        0        0        0      27.704    26.361    27.704    27.704     0      1      0      3.0536     1      19      1       0       1       1       1      72.439    72.641     0     19.75     0       1        2       1       1       1       11       0        0        0       0      1  
        :                 :                :        :        :         :        :       :       :       :      :       :         :         :         :        :       :        :        :          :          :       :         :        :       :       :       :        :         :         :         :        :        :         :         :         :          :         :        :        :      :      :       :          :           :        :       :       :         :           :        :      :     :      :       :        :        :        :        :        :         :         :         :        :      :      :        :        :       :      :       :       :       :       :        :         :        :       :       :       :        :       :       :       :        :       :        :        :       :      :
        :                 :                :        :        :         :        :       :       :       :      :       :         :         :         :        :       :        :        :          :          :       :         :        :       :       :       :        :         :         :         :        :        :         :         :         :          :         :        :        :      :      :       :          :           :        :       :       :         :           :        :      :     :      :       :        :        :        :        :        :         :         :         :        :      :      :        :        :       :      :       :       :       :       :        :         :        :       :       :       :        :       :       :       :        :       :        :        :       :      :

Now let's try the same read strategies on Parquet files.

Parquet Read Strategy 1 - Read the entire data into memory

[flightsOverMA, parquetTime1] = findFlightsAllData(tds, MACoordinates);

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 46 sec
Evaluation completed in 46 sec

head(flightsOverMA)

     TripNo       LATP      LONP  
    _________    ______    _______

602e+14    42.032    -73.323
602e+14    42.033    -73.323
602e+14    42.035    -73.323
602e+14    42.037    -73.323
602e+14    42.039    -73.323
602e+14     42.04    -73.323
602e+14    42.042    -73.323
602e+14    42.044    -73.323

height(flightsOverMA)

ans = 301

parquetTime1

parquetTime1 = 48.3910

Parquet Read Strategy 2 - Read all columns for matching latitudes and longitudes

This where parquet's predicate pushdown comes into action. When we index into the tall array using the latitude and longitude values, tall uses that information to create the corresponding rowfilter and then reads from the parquet files. 

[flightsOverMA, parquetTime2] = findFlightsFilteredRowsAllColumns(tds, MACoordinates);

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 15 sec
Evaluation completed in 15 sec

height(flightsOverMA)

ans = 301

parquetTime2

parquetTime2 = 18.3050

Parquet Read Strategy 3 - Read only relevant columns for matching latitudes and longitudes

Now, in addition to predicate pushdown, let's just read the relevant columns.

[flightsOverMA, parquetTime3] = findFlightsFilteredRowsAndColumns(tds,MACoordinates);

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 1.1 sec
Evaluation completed in 1.1 sec

height(flightsOverMA)

ans = 301

parquetTime3

parquetTime3 = 1.5900

Results

Like the csv reads, we were able to find the 301 rows with Massachusetts coordinates from the Parquet files, but the results were much faster. Let's review what we have achieved here by working with Parquet files instead of csv. 

plotResults([csvTime1, parquetTime1; csvTime2, parquetTime2; csvTime3, parquetTime3])

Read Strategy 1: This was the heavy hammer scenario where we read the entire data in memory. By simply rewriting the files in Parquet gave us about 12x faster performance.
Read Strategy 2: A better way to work with this data is to only read the rows that are needed. We used the same indexing code for our tall table on Parquet as we did for csv. However, when working with Parquet files, tall used the indexing information for predicate pushdown to do read-time filtering. As a result, reading the parquet files was 28x faster than reading csv. It was also almost 2.5x faster than reading the whole Parquet dataset.
Read Strategy 3: This is where we have completely used Parquet's predicate and projection pushdown capabilities to read data efficiently. In addition to the row filtering in Read Strategy 2, we also read only the relevant columns. Because Parquet files are column oriented, the reader only needed to read a contiguous block of memory instead of finding latitudes and longitudes spread all over the file. As a result, we were able to read the data in under 2 seconds; a speed up of more than two orders of magnitude when compared to doing the same using csv.

Let's also check the actual flight path. It turns out that only one flight (out of 4578) flew over Massachusetts. It just grazed by its western border with New York state. We will find extract the complete flight information from the tall table and then plot the flght path.

flights = unique(flightsOverMA.TripNo)

flights = 6.6020e+14

completeFlightPath = gather(tds(tds.TripNo == flights,["LATP","LONP"])); 

Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 0.31 sec
Evaluation completed in 0.33 sec

geoplot(completeFlightPath.LATP,completeFlightPath.LONP,'Color','b','LineWidth',3);

geobasemap colorterrain

geolimits([34.3 50.8],[-97.6 -69.0])

Summary

This post covered a quick understanding of how Parquet can help when working with big data. We saw how using Parquet files, we were able to achieve more than 100x performance improvement without needing additional processing power or parallelization. This kind of data engineering and analysis technique is a good first step when working with big data before reaching out for more power. However, here are a few things to keep in mind before we wrap up. 

1. Parquet files store data in the form of a table. This can be good for most cases. But if the data is not tabular to begin with (e.g. multi-dimensional data), it may not be suitable for Parquet files. 

2. One disadvantage of using Parquet files over csv is that it cannot be opened with applications like Excel. This reduces readability and ease of use. 

3. Parquet files do not provide the same level of efficiency for all data sizes. In fact, for smaller data sizes, there might not be much benefit at all. Also, to keep this blog introductory, our example showed a tabular data written into Parquet files using default row group sizing. We did not dig deeper into the impact of using different rowgroup sizes. We also did not talk about how parallelization can help in making the process faster. These choices do impact the level of performance that can be achieved by using Parquet files. Let us know if you are interested in any of these topics (or other data related topics) and we can cover them in future blogs.

4. Lastly, Parquet is not the only way to work with big data. While its a promising file format, there can be good reasons to store and work with big data in other formats (csv, spreadsheet, mat etc.) and sources (cloud, databases, data platforms etc.). For more information on how MATLAB can work with different kinds of big data, please visit: https://www.mathworks.com/solutions/big-data-matlab.html 

Helper functions

function dataWriterDefault(info, data ) 

    filename = info.SuggestedFilename;

    parquetwrite(filename, data)

end

function [flightsOverMA, t] = findFlightsAllData(tds, coordinates)

    tstart = tic;

    allData = gather(tds);

    flightsOverMA = allData(allData.LATP>coordinates(1) & allData.LATP<coordinates(2) & allData.LONP>coordinates(3) & allData.LONP<coordinates(4),["TripNo","LATP","LONP"]);

    t = toc(tstart);

end

function [flightsOverMA, t] = findFlightsFilteredRowsAllColumns(tds, coordinates)

    tstart = tic;

    flightsOverMAtall = tds(tds.LATP>coordinates(1) & tds.LATP<coordinates(2) & tds.LONP>coordinates(3) & tds.LONP<coordinates(4),:); 

    flightsOverMAAllColumns = gather(flightsOverMAtall);

    flightsOverMA = flightsOverMAAllColumns(:,["TripNo","LATP","LONP"]);

    t = toc(tstart);

end

function [flightsOverMA, t] = findFlightsFilteredRowsAndColumns(tds, coordinates)

    tstart = tic;

    flightsOverMAtall = tds(tds.LATP>coordinates(1) & tds.LATP<coordinates(2) & tds.LONP>coordinates(3) & tds.LONP<coordinates(4),["TripNo","LATP","LONP"]); 

    flightsOverMA = gather(flightsOverMAtall);

    t = toc(tstart);

end

function plotResults(data)

    b = bar(data);

    ylabel("seconds")

    xticklabels(["Read strategy 1","Read strategy 2","Read strategy 3"])

    xtips1 = b(1).XEndPoints;

    ytips1 = b(1).YEndPoints;

    labels1 = string(round(b(1).YData,1))+"s";

    text(xtips1,ytips1,labels1,'HorizontalAlignment','center',...

        'VerticalAlignment','bottom')

    xtips2 = b(2).XEndPoints;

    ytips2 = b(2).YEndPoints;

    labels2 = string(round(b(2).YData,1))+"s";

    text(xtips2,ytips2,labels2,'HorizontalAlignment','center',...

        'VerticalAlignment','bottom')

    legend(["csv","parquet"])

    ylim([0, data(1,1)*1.1])

end

Category:: Big Data,; Data Science,; Guest posts,; Open Source,; performance

Comments

To leave a comment, please click here to sign in to your MathWorks Account or create a new one.

The MATLAB Blog
Practical Advice for People on the Leading Edge

Practical Advice for People on the Leading Edge

Working efficiently with data: Parquet files and the Needle in a Haystack problem

A quick introduction to Parquet

Dataset

Problem statement

CSV vs. Parquet: Single file

Datastore, Tall and Rowfilter

Data reading strategies

CSV

CSV Read Strategy 1 - Read the entire data into memory

CSV Read Strategy 2 - Read all columns for matching latitudes and longitudes

CSV Read Strategy 3 - Read only relevant columns for matching latitudes and longitudes

Parquet

Parquet Read Strategy 1 - Read the entire data into memory

Parquet Read Strategy 2 - Read all columns for matching latitudes and longitudes

Parquet Read Strategy 3 - Read only relevant columns for matching latitudes and longitudes

Results

Summary

Helper functions

Comments

A quick introduction to Parquet

Dataset

Problem statement

CSV vs. Parquet: Single file

Datastore, Tall and Rowfilter

Data reading strategies

CSV

CSV Read Strategy 1 - Read the entire data into memory

CSV Read Strategy 2 - Read all columns for matching latitudes and longitudes

CSV Read Strategy 3 - Read only relevant columns for matching latitudes and longitudes

Parquet

Parquet Read Strategy 1 - Read the entire data into memory

Parquet Read Strategy 2 - Read all columns for matching latitudes and longitudes

Parquet Read Strategy 3 - Read only relevant columns for matching latitudes and longitudes

Results

Summary

Helper functions

See Also

Comments

Select a Web Site

Americas

Europe

Asia Pacific