Introduction to the New MATLAB Data Types in R2013b

作者 Loren Shure, September 10, 2013

15 次查看（过去 30 天） | 0 个赞 | 30 个评论

Today I’d like to introduce a fairly frequent guest blogger Sarah Wait Zaranek who works for the MATLAB Marketing team here at The MathWorks. She and I will be writing about the new capabilities for MATLAB in R2013b. In particular, there are two new data types in MATLAB in R2013b – table and categorical arrays.

What are Tables and Categorical Arrays?
Importing Data into a Table
Looking at Variable Names (Column Names)
Accessing the Data in Your Table
Converting Data to Categorical Arrays
Creating a New Table
Adding/Removing Variables
Removing Missing Data
Summarizing a Table
Sorting Data
Applying Functions to Table Variables
Joining (Merging) Tables
Plotting Data From the Final Table
Your Thoughts?

What are Tables and Categorical Arrays?

Table is a new data type suitable for holding heterogenous data and metadata. Specifically, tables are useful for mixed-type tabular data that are often stored as columns in a text file or in a spreadsheet. Tables consist of rows and column-oriented variables. Categorical arrays are useful for holding categorical data - which have values from a finite list of discrete categories.

One of the best ways to learn more about tables and categorical arrays is to see them in action. So, in this post, we will use tables and categoricals to examine some airplane flight delay data. The flight data is freely available from the Bureau of Transportation Statistics (BTS). You can download it yourself here. The weather data is from the National Climatic Data Center (NCDC) and is available here.

Importing Data into a Table

You can import your data into a table interactively using the Import Tool or you can do it programmatically, using readtable.

FlightData = readtable('Jan2010Flights.csv');

whos FlightData

  Name                Size              Bytes  Class    Attributes

  FlightData      17816x7             9007742  table

The entire contents of the file are now contained in a single variable – a table. Here you are reading in your data from a csv file. readtable also supports reading from .txt,.dat text files and Excel spreadsheet files. Tables can also be created directly from variables in your workspace.

Looking at Variable Names (Column Names)

You can see all the variable names (column names in our table) by looking at the VariableNames properties.

FlightData.Properties.VariableNames

ans = 
  Columns 1 through 5
    'FL_DATE'    'CARRIER'    'ORIGIN'    'DEST'    'CRS_DEP_TIME'
  Columns 6 through 7
    'DEP_TIME'    'DEP_DELAY'

This particular table does not contain any row names, but for a table with row names you can access the row names using the RowNames property.

Accessing the Data in Your Table

There are multiple ways to access the data in your table. You can use dot indexing to access or modify a single table variable, similar to how you use fieldnames in structures.

For example, using dot indexing you can plot a histogram of the departure delays (in minutes).

hist(FlightData.DEP_DELAY)
title('Histogram of Flight Delays in Minutes')

You can also display the first 5 departure delays.

FlightData.DEP_DELAY(1:5)

You can also extract data from one or more variables in the table using curly braces. Within the curly braces you can use numeric indexing or variable and row names. For example, you can extract the actual departure times and scheduled depature times for the first 5 flights.

SomeTimes = FlightData{1:5,{'DEP_TIME','CRS_DEP_TIME'}};
disp(SomeTimes)

        1149        1100
        1053        1100
        1055        1100
        1052        1100
        1050        1100

This is similar to indexing with cell arrays. However, unlike with cells, this concatenates the specified variables into a single array. Therefore, the data types of all the specified variables need to be compatible for concatenation.

Converting Data to Categorical Arrays

You can convert some of the variables in your table using categorical. Categorical arrays are more memory efficent than holding cell arrays of strings when you have repeated data. Categorical arrays store only one copy of each category name, reducing the amount of memory required to store the array. You can use whos to see the amount of memory you save by converting the data to a categorical array.

whos FlightData

  Name                Size              Bytes  Class    Attributes

  FlightData      17816x7             9007742  table

FlightData.ORIGIN = categorical(FlightData.ORIGIN);
FlightData.DEST = categorical(FlightData.DEST);
FlightData.CARRIER = categorical(FlightData.CARRIER);

whos FlightData

  Name                Size              Bytes  Class    Attributes

  FlightData      17816x7             2857264  table

Using categories, you can find all the distinct categories in your array. By default, categorical arrays do not define a definite order. If your data contains categories with a definite order, you can set the 'Ordinal' flag to true when creating your categorical array. The default the order will be alphabetical, but you can prescribe you own order instead.

categories(FlightData.CARRIER)

ans = 
    '9E'
    'AA'
    'AS'
    'B6'
    'CO'
    'DL'
    'F9'
    'FL'
    'MQ'
    'OH'
    'UA'
    'US'
    'WN'
    'XE'
    'YV'

Categorical arrays are also faster and more convenient than cell arrays of strings for indexing and searching. By converting to categorical arrays, you can then mathematically compare sets of strings just like you would do with numeric values. You can use this functionality to create a new table containing only the flights that left from Boston.

Creating a New Table

You can create a new table from a section of an existing table using parentheses with numerical indexing, variable names, or row names. Since the flight origin is now a categorical array, you can use logical indexing to find all flights that left from Boston.

idxBoston = FlightData.ORIGIN == 'BOS' ;
BostonFlights = FlightData(idxBoston,:);

height(FlightData)
height(BostonFlights)

ans =
       17816
ans =
        8904

Adding/Removing Variables

You can also modify your table by adding and removing variables and rows. All variables in a table must have the same number of rows, but they can be of different widths.

Let's add a new variable (DATE) to represent the serial date number for the various flight dates.

BostonFlights.DATE = datenum(BostonFlights.FL_DATE);

The origin now has all BOS values and you are not going to use destination information right now, so those variables can be removed. HOUR can also be calculated, as well as a LATE variable which indicates if the flight was 15 minutes late or more.

BostonFlights.ORIGIN = [];
BostonFlights.DEST = [];
BostonFlights.FL_DATE = [];

BostonFlights.HOUR = floor(BostonFlights.CRS_DEP_TIME./100);
BostonFlights{:,'LATE'} = BostonFlights.DEP_DELAY > 15;

Removing Missing Data

Tables have supported functions for finding and standardizing missing data. In this case, you can find any missing data using ismissing and remove it. You can use height, which gives you the number of table rows, to see how many flights were removed from the table. We exploit logical indexing to get only the flights that have no missing data.

height(BostonFlights)

ans =
        8904

TF = any(ismissing(BostonFlights),2);
BostonFlights = BostonFlights(~TF,:);
height(BostonFlights)

ans =
        8640

Summarizing a Table

You can then see descriptive statistics for each variable in this new table by using summary.

summary(BostonFlights)

Variables:
    CARRIER: 8640x1 categorical
        Values:
            9E      85     
            AA     913     
            AS      55     
            B6    1726     
            CO     321     
            DL    1215     
            F9      26     
            FL     536     
            MQ     751     
            OH     341     
            UA     643     
            US    1494     
            WN     384     
            XE      93     
            YV      57     
    CRS_DEP_TIME: 8640x1 double
        Values:
            min        500          
            median    1215          
            max       2359          
    DEP_TIME: 8640x1 double
        Values:
            min          2      
            median    1224      
            max       2400      
    DEP_DELAY: 8640x1 double
        Values:
            min       -25        
            median     -3        
            max       419        
    DATE: 8640x1 double
        Values:
            min       7.3414e+05
            median    7.3415e+05
            max       7.3417e+05
    HOUR: 8640x1 double
        Values:
            min        5    
            median    12    
            max       23    
    LATE: 8640x1 logical
        Values:
            true     1471  
            false    7169

Sorting Data

There are additional functions to sort tables, apply functions to table variables, and merge tables together. For example, you can sort your BostonFlights by departure delay.

BostonFlights = sortrows(BostonFlights,'DEP_DELAY','descend');
BostonFlights(1:10,:)

ans = 
    CARRIER    CRS_DEP_TIME    DEP_TIME    DEP_DELAY       DATE       HOUR
    _______    ____________    ________    _________    __________    ____
    DL         1830             129        419          7.3416e+05    18  
    DL         1850             111        381          7.3414e+05    18  
    CO         1755              10        375          7.3414e+05    17  
    AA         1710            2323        373          7.3416e+05    17  
    DL          630            1240        370          7.3416e+05     6  
    FL         1400            1951        351          7.3414e+05    14  
    FL         1741            2330        349          7.3414e+05    17  
    UA         1906              22        316          7.3414e+05    19  
    AA          840            1355        315          7.3416e+05     8  
    AA          905            1420        315          7.3414e+05     9  

    LATE 
    _____
    true 
    true 
    true 
    true 
    true 
    true 
    true 
    true 
    true 
    true

Applying Functions to Table Variables

You can apply functions to work with table variables, with varfun.

varfun has optional additional calling inputs such as 'InputVariables' and 'GroupingVariables'. 'InputVariables' lets you specific which variables you want to operate on instead of operating on all the variables in your table. 'GroupingVariables' let you define groups of rows on which to operate. varfun would then apply your function to each group of rows within each of the variables of your table, rather than to each entire variable.

You can use varfun to calculate the mean delay for all flights and the fraction of late flights for a given hour on a given day. The default output of varfun is a table.

ByHour = varfun(@mean, BostonFlights, ...
    'InputVariables', {'DEP_DELAY', 'LATE'},...
    'GroupingVariables',{'DATE','HOUR'});

disp(ByHour(1:5,:))

                   DATE       HOUR    GroupCount    mean_DEP_DELAY    mean_LATE
                __________    ____    __________    ______________    _________
    734139_5    7.3414e+05    5        5               3.8                  0  
    734139_6    7.3414e+05    6       19            10.579            0.31579  
    734139_7    7.3414e+05    7       18            6.7778            0.11111  
    734139_8    7.3414e+05    8       21            8.8571            0.33333  
    734139_9    7.3414e+05    9       17            5.0588            0.23529

Joining (Merging) Tables

Weather might have an important role in determining if a flight is delayed. For a given hour, you might want to know both the delayed flight information and the weather at the airport. So, you can start by reading in another table containing weather data for Boston Logan Airport. Then, you can merge that table with the existing ByHour table.

Since there are a lot of variables in this file, you can specify the input format when using readtable. This allows you to use * to skip variables that you aren't interested in loading into the table. For more information about specifying formating strings, see here in the documentation. Since this data uses 'M' to represent missing data, you can use 'TreatAsEmpty' to replace any instances of 'M' with the standard missing value indicator (NaN for numeric values).

FormatStr = ['%*s%s%f' repmat('%*s',1,9) '%f' repmat('%*s',1,7) '%f',...
             repmat('%*s',1,3),'%f', repmat('%*s',1,18)];

WeatherData = readtable('BostonWeather.txt','HeaderLines',6,...
                        'Format',FormatStr,'TreatAsEmpty','M');

WeatherData.Properties.VariableNames

ans = 
    'Date'    'Time'    'DryBulbCelsius'    'DewPointCelsius'    'WindSpeed'

WeatherData contains the date, time, dew point and dry bulb temperature in Celsius, and wind speed. Let's convert DATE to a serial date number and round to the hour for the time measurement.

WeatherData.DATE = datenum(WeatherData.Date,'yyyymmdd');
WeatherData.Date = [];
WeatherData.HOUR = floor(WeatherData.Time/100);

Since there are multiple weather measurements per hour, you can average the data by hour using varfun.

ByHourWeather = varfun(@mean, WeatherData, ...
    'InputVariables', {'DryBulbCelsius','DewPointCelsius','WindSpeed'},...
    'GroupingVariables',{'DATE','HOUR'});

Now, you can merge the two tables using join which matches rows using key variables (columns) common to both tables. join keeps all the variables from the first input table and appends the corresponding variables from the second input table. The table that join creates will use the key variable values as the row names. In this case, that means the row names will represent the date and hour data.

AllData = join(ByHour,ByHourWeather,'Keys',{'DATE','HOUR'});

Plotting Data From the Final Table

Let's now plot this final data set to get an idea of the effect of weather on the flight delays.

AllData.TDIFF =  ...
    abs(AllData.mean_DewPointCelsius - AllData.mean_DryBulbCelsius);

scatter(AllData.TDIFF, AllData.mean_DEP_DELAY,...
    [], AllData.mean_DEP_DELAY,'filled');
xlabel('abs(DewPoint-Temperature)')
ylabel('Average Departure Delay')

scatter3( AllData.HOUR,AllData.TDIFF,AllData.mean_DEP_DELAY,...
    [],AllData.mean_DEP_DELAY,'filled');
xlabel('Hour of Flight')
ylabel('abs(DewPoint-Temperature)')
zlabel('Average Departure Delay')

Qualitatively it looks having a temperature near the dew point (greater change of precipitation) effects the departure time. There are other factors at work as well, but it is nice to know that our intuition (flights later in the day and when it is cold and snowing might have a greater chance to be delayed) seems to work with the data.