Loren on the Art of MATLAB

Introduction to the New MATLAB Data Types in R2013b 30

Posted by Loren Shure,

Today I’d like to introduce a fairly frequent guest blogger Sarah Wait Zaranek who works for the MATLAB Marketing team here at The MathWorks. She and I will be writing about the new capabilities for MATLAB in R2013b. In particular, there are two new data types in MATLAB in R2013b – table and categorical arrays.

Contents

What are Tables and Categorical Arrays?

Table is a new data type suitable for holding heterogenous data and metadata. Specifically, tables are useful for mixed-type tabular data that are often stored as columns in a text file or in a spreadsheet. Tables consist of rows and column-oriented variables. Categorical arrays are useful for holding categorical data - which have values from a finite list of discrete categories.

One of the best ways to learn more about tables and categorical arrays is to see them in action. So, in this post, we will use tables and categoricals to examine some airplane flight delay data. The flight data is freely available from the Bureau of Transportation Statistics (BTS). You can download it yourself here. The weather data is from the National Climatic Data Center (NCDC) and is available here.

Importing Data into a Table

You can import your data into a table interactively using the Import Tool or you can do it programmatically, using readtable.

FlightData = readtable('Jan2010Flights.csv');

whos FlightData
  Name                Size              Bytes  Class    Attributes

  FlightData      17816x7             9007742  table              

The entire contents of the file are now contained in a single variable – a table. Here you are reading in your data from a csv file. readtable also supports reading from .txt,.dat text files and Excel spreadsheet files. Tables can also be created directly from variables in your workspace.

Looking at Variable Names (Column Names)

You can see all the variable names (column names in our table) by looking at the VariableNames properties.

FlightData.Properties.VariableNames
ans = 
  Columns 1 through 5
    'FL_DATE'    'CARRIER'    'ORIGIN'    'DEST'    'CRS_DEP_TIME'
  Columns 6 through 7
    'DEP_TIME'    'DEP_DELAY'

This particular table does not contain any row names, but for a table with row names you can access the row names using the RowNames property.

Accessing the Data in Your Table

There are multiple ways to access the data in your table. You can use dot indexing to access or modify a single table variable, similar to how you use fieldnames in structures.

For example, using dot indexing you can plot a histogram of the departure delays (in minutes).

hist(FlightData.DEP_DELAY)
title('Histogram of Flight Delays in Minutes')

You can also display the first 5 departure delays.

FlightData.DEP_DELAY(1:5)
ans =
    49
    -7
    -5
    -8
   -10

You can also extract data from one or more variables in the table using curly braces. Within the curly braces you can use numeric indexing or variable and row names. For example, you can extract the actual departure times and scheduled depature times for the first 5 flights.

SomeTimes = FlightData{1:5,{'DEP_TIME','CRS_DEP_TIME'}};
disp(SomeTimes)
        1149        1100
        1053        1100
        1055        1100
        1052        1100
        1050        1100

This is similar to indexing with cell arrays. However, unlike with cells, this concatenates the specified variables into a single array. Therefore, the data types of all the specified variables need to be compatible for concatenation.

Converting Data to Categorical Arrays

You can convert some of the variables in your table using categorical. Categorical arrays are more memory efficent than holding cell arrays of strings when you have repeated data. Categorical arrays store only one copy of each category name, reducing the amount of memory required to store the array. You can use whos to see the amount of memory you save by converting the data to a categorical array.

whos FlightData
  Name                Size              Bytes  Class    Attributes

  FlightData      17816x7             9007742  table              

FlightData.ORIGIN = categorical(FlightData.ORIGIN);
FlightData.DEST = categorical(FlightData.DEST);
FlightData.CARRIER = categorical(FlightData.CARRIER);

whos FlightData
  Name                Size              Bytes  Class    Attributes

  FlightData      17816x7             2857264  table              

Using categories, you can find all the distinct categories in your array. By default, categorical arrays do not define a definite order. If your data contains categories with a definite order, you can set the 'Ordinal' flag to true when creating your categorical array. The default the order will be alphabetical, but you can prescribe you own order instead.

categories(FlightData.CARRIER)
ans = 
    '9E'
    'AA'
    'AS'
    'B6'
    'CO'
    'DL'
    'F9'
    'FL'
    'MQ'
    'OH'
    'UA'
    'US'
    'WN'
    'XE'
    'YV'

Categorical arrays are also faster and more convenient than cell arrays of strings for indexing and searching. By converting to categorical arrays, you can then mathematically compare sets of strings just like you would do with numeric values. You can use this functionality to create a new table containing only the flights that left from Boston.

Creating a New Table

You can create a new table from a section of an existing table using parentheses with numerical indexing, variable names, or row names. Since the flight origin is now a categorical array, you can use logical indexing to find all flights that left from Boston.

idxBoston = FlightData.ORIGIN == 'BOS' ;
BostonFlights = FlightData(idxBoston,:);

height(FlightData)
height(BostonFlights)
ans =
       17816
ans =
        8904

Adding/Removing Variables

You can also modify your table by adding and removing variables and rows. All variables in a table must have the same number of rows, but they can be of different widths.

Let's add a new variable (DATE) to represent the serial date number for the various flight dates.

BostonFlights.DATE = datenum(BostonFlights.FL_DATE);

The origin now has all BOS values and you are not going to use destination information right now, so those variables can be removed. HOUR can also be calculated, as well as a LATE variable which indicates if the flight was 15 minutes late or more.

BostonFlights.ORIGIN = [];
BostonFlights.DEST = [];
BostonFlights.FL_DATE = [];

BostonFlights.HOUR = floor(BostonFlights.CRS_DEP_TIME./100);
BostonFlights{:,'LATE'} = BostonFlights.DEP_DELAY > 15;

Removing Missing Data

Tables have supported functions for finding and standardizing missing data. In this case, you can find any missing data using ismissing and remove it. You can use height, which gives you the number of table rows, to see how many flights were removed from the table. We exploit logical indexing to get only the flights that have no missing data.

height(BostonFlights)
ans =
        8904
TF = any(ismissing(BostonFlights),2);
BostonFlights = BostonFlights(~TF,:);
height(BostonFlights)
ans =
        8640

Summarizing a Table

You can then see descriptive statistics for each variable in this new table by using summary.

summary(BostonFlights)
Variables:
    CARRIER: 8640x1 categorical
        Values:
            9E      85     
            AA     913     
            AS      55     
            B6    1726     
            CO     321     
            DL    1215     
            F9      26     
            FL     536     
            MQ     751     
            OH     341     
            UA     643     
            US    1494     
            WN     384     
            XE      93     
            YV      57     
    CRS_DEP_TIME: 8640x1 double
        Values:
            min        500          
            median    1215          
            max       2359          
    DEP_TIME: 8640x1 double
        Values:
            min          2      
            median    1224      
            max       2400      
    DEP_DELAY: 8640x1 double
        Values:
            min       -25        
            median     -3        
            max       419        
    DATE: 8640x1 double
        Values:
            min       7.3414e+05
            median    7.3415e+05
            max       7.3417e+05
    HOUR: 8640x1 double
        Values:
            min        5    
            median    12    
            max       23    
    LATE: 8640x1 logical
        Values:
            true     1471  
            false    7169  

Sorting Data

There are additional functions to sort tables, apply functions to table variables, and merge tables together. For example, you can sort your BostonFlights by departure delay.

BostonFlights = sortrows(BostonFlights,'DEP_DELAY','descend');
BostonFlights(1:10,:)
ans = 
    CARRIER    CRS_DEP_TIME    DEP_TIME    DEP_DELAY       DATE       HOUR
    _______    ____________    ________    _________    __________    ____
    DL         1830             129        419          7.3416e+05    18  
    DL         1850             111        381          7.3414e+05    18  
    CO         1755              10        375          7.3414e+05    17  
    AA         1710            2323        373          7.3416e+05    17  
    DL          630            1240        370          7.3416e+05     6  
    FL         1400            1951        351          7.3414e+05    14  
    FL         1741            2330        349          7.3414e+05    17  
    UA         1906              22        316          7.3414e+05    19  
    AA          840            1355        315          7.3416e+05     8  
    AA          905            1420        315          7.3414e+05     9  

    LATE 
    _____
    true 
    true 
    true 
    true 
    true 
    true 
    true 
    true 
    true 
    true 

Applying Functions to Table Variables

You can apply functions to work with table variables, with varfun.

varfun has optional additional calling inputs such as 'InputVariables' and 'GroupingVariables'. 'InputVariables' lets you specific which variables you want to operate on instead of operating on all the variables in your table. 'GroupingVariables' let you define groups of rows on which to operate. varfun would then apply your function to each group of rows within each of the variables of your table, rather than to each entire variable.

You can use varfun to calculate the mean delay for all flights and the fraction of late flights for a given hour on a given day. The default output of varfun is a table.

ByHour = varfun(@mean, BostonFlights, ...
    'InputVariables', {'DEP_DELAY', 'LATE'},...
    'GroupingVariables',{'DATE','HOUR'});

disp(ByHour(1:5,:))
                   DATE       HOUR    GroupCount    mean_DEP_DELAY    mean_LATE
                __________    ____    __________    ______________    _________
    734139_5    7.3414e+05    5        5               3.8                  0  
    734139_6    7.3414e+05    6       19            10.579            0.31579  
    734139_7    7.3414e+05    7       18            6.7778            0.11111  
    734139_8    7.3414e+05    8       21            8.8571            0.33333  
    734139_9    7.3414e+05    9       17            5.0588            0.23529  

Joining (Merging) Tables

Weather might have an important role in determining if a flight is delayed. For a given hour, you might want to know both the delayed flight information and the weather at the airport. So, you can start by reading in another table containing weather data for Boston Logan Airport. Then, you can merge that table with the existing ByHour table.

Since there are a lot of variables in this file, you can specify the input format when using readtable. This allows you to use * to skip variables that you aren't interested in loading into the table. For more information about specifying formating strings, see here in the documentation. Since this data uses 'M' to represent missing data, you can use 'TreatAsEmpty' to replace any instances of 'M' with the standard missing value indicator (NaN for numeric values).

FormatStr = ['%*s%s%f' repmat('%*s',1,9) '%f' repmat('%*s',1,7) '%f',...
             repmat('%*s',1,3),'%f', repmat('%*s',1,18)];

WeatherData = readtable('BostonWeather.txt','HeaderLines',6,...
                        'Format',FormatStr,'TreatAsEmpty','M');

WeatherData.Properties.VariableNames
ans = 
    'Date'    'Time'    'DryBulbCelsius'    'DewPointCelsius'    'WindSpeed'

WeatherData contains the date, time, dew point and dry bulb temperature in Celsius, and wind speed. Let's convert DATE to a serial date number and round to the hour for the time measurement.

WeatherData.DATE = datenum(WeatherData.Date,'yyyymmdd');
WeatherData.Date = [];
WeatherData.HOUR = floor(WeatherData.Time/100);

Since there are multiple weather measurements per hour, you can average the data by hour using varfun.

ByHourWeather = varfun(@mean, WeatherData, ...
    'InputVariables', {'DryBulbCelsius','DewPointCelsius','WindSpeed'},...
    'GroupingVariables',{'DATE','HOUR'});

Now, you can merge the two tables using join which matches rows using key variables (columns) common to both tables. join keeps all the variables from the first input table and appends the corresponding variables from the second input table. The table that join creates will use the key variable values as the row names. In this case, that means the row names will represent the date and hour data.

AllData = join(ByHour,ByHourWeather,'Keys',{'DATE','HOUR'});

Plotting Data From the Final Table

Let's now plot this final data set to get an idea of the effect of weather on the flight delays.

AllData.TDIFF =  ...
    abs(AllData.mean_DewPointCelsius - AllData.mean_DryBulbCelsius);

scatter(AllData.TDIFF, AllData.mean_DEP_DELAY,...
    [], AllData.mean_DEP_DELAY,'filled');
xlabel('abs(DewPoint-Temperature)')
ylabel('Average Departure Delay')
scatter3( AllData.HOUR,AllData.TDIFF,AllData.mean_DEP_DELAY,...
    [],AllData.mean_DEP_DELAY,'filled');
xlabel('Hour of Flight')
ylabel('abs(DewPoint-Temperature)')
zlabel('Average Departure Delay')

Qualitatively it looks having a temperature near the dew point (greater change of precipitation) effects the departure time. There are other factors at work as well, but it is nice to know that our intuition (flights later in the day and when it is cold and snowing might have a greater chance to be delayed) seems to work with the data.

Your Thoughts?

Can you see yourself using tables and categorical arrays? Let us know what you think or if you have any questions by leaving a comment here.


Get the MATLAB code

Published with MATLAB® R2013b

30 CommentsOldest to Newest

@Cary —

I was preparing this comment as your question came in. Good timing.

These data types might seem familiar, especially if you are using the Statistics Toolbox. Tables might feel like datasets, and categorical arrays like ordinal/nominal arrays.

Table and categorical arrays are meant to replace dataset and nominal/ordinal arrays, providing access to this useful functionality in MATLAB without requiring a toolbox.

We recommend using table and categorical for all new work, and migrating existing usage of dataset, nominal, and ordinal to table and categorical over time. There are functions to help you with that transition (see table2dataset and dataset2table).

Great post as usual! I especially like how with tables we can still vectorize code seamlessly, as you did when calculating TDIFF, since this ability is harder to retain with other natives (e.g., structs). Anyway, the summary() function seems like it would also be useful for other general variables! (So you can get max/min/size/class etc at the prompt very quickly.)

For those of us with more CS backgrounds than statistics, would it be correct to say a Categorical is an Enumerated data type, or is there some nuance I’m missing?

I think these are great additions to the base MATLAB language and am looking forward to using them. I just wish that they were able to be ported to the last few releases, because it takes a while for our user base to fully migrate to a new release and we can’t make production use of new features until that happens :)

Tom, categorical is similar to (C, say) enumerations in some ways, but different in other ways.

* Categorical arrays allow you to choose meaningful names for discrete values, and avoid having to remember that the integer value 1 means ‘North America’, 2 means ‘Europe’, and so on.

* Categorical arrays allow you to choose whether or not your discrete values have a mathematical ordering where one category is “larger than” another.

* Categorical array _don’t_ require you to define the entire universe of possibilities in advance — you can add, remove, merge, or rename categories at any time. This is where their use in data exploration and cleaning becomes clear.

* This being MATLAB, categorical arrays are, well, _arrays_, and their relational operators are particularly useful for creating logical indices to pick out subsets of other data.

* “Enumerations”, to me, are often useful to define named constants to make code more readable or to strongly type a set of choices. Those are probably jobs better served by enumerations in MATLAB’s object system rather than categorical arrays. Categorical arrays are about data.

Hope this helps.

Nice article, thanks. Is there a recommended best practice for programmatically checking that the table class is supported in the currently running MATLAB version? For example,

>> exist(‘table’,'class’)
ans = 8

looks like a good solution. However, in context I feel it’s not great to have to test against that literal numeric value. I would want to have a defined constant to test against, e.g. :

% Define constant somewhere prior..
ID_EXIST_CLASS = 8

if exist(‘table’,'class’) ~= ID_EXIST_CLASS
error(‘This version of MATLAB doesn”t support table’);
end

The problem with this is having to repeat that ID_EXIST_CLASS definition in each table-dependent function. What would be good style in this case?

Any comments appreciated,
Thanks,
Brad

A couple of updates to Sarah and Loren’s blog post:

1) For those who want to download the data and try out tables for themselves, a caveat: If you go to the Bureau of Transportation Statistics website to get the flight data, you may find that the file you end up with is much larger than you expect, has a different set of fields, and has an unusual format. None of these prevent you from using it, but you’ll end up with a slightly different table than the one used in the post, and readtable will take longer than usual to read it in.

We’ll try to get simpler versions of the two files posted, but in the meantime, if you want to duplicate what you see in the post, I recommend the following:

* Set filters on the webform to download data only for Massachusetts, 2010, January.

* Use the checkboxes on the webform to select only the FlightDate, Carrier, Origin, Dest, CRSDepTime, DepTime, and DepDelay fields.

* Read the data into MATLAB using readtable with a format string that creates string variables for CRSDepTime and DepTime (they are quoted numbers in the file, which makes it difficult for readtable to infer the right format).

>> FlightData = readtable(‘Jan2010Flights.csv’,'Format’,'%s%q%q%q%q%q%f%f’);

* Convert those two variables to numeric, and delete the spurious Var8 variable (which is due to the extra comma at the end of each line in the file).

>> FlightData.CRS_DEP_TIME = str2double(FlightData.CRS_DEP_TIME);
>> FlightData.DEP_TIME = str2double(FlightData.DEP_TIME);
>> FlightData.Var8 = [];

* For the weather data, you’ll want to select Boston Logan Airport (in Massachusetts) hourly data for January 2010 as CSV from the Unedited Local Climatological Data (ULCD). You may need to copy the data from your browser and paste to a text file.

2) There’s an option in readtable that allows you to convert those missing data codes (‘M’) in the weather data file as you read them into MATLAB, and create numeric variables directly: the ‘TreatAsEmpty’ parameter.

>> FormatStr = ['%*s%f%f' repmat('%*s',1,9) '%f' repmat('%*s',1,7) '%f',repmat('%*s',1,3),'%f', repmat('%*s',1,18)];
>> WeatherData = readtable(‘BostonWeather.txt’,'HeaderLines’,6,’Format’,FormatStr,’TreatAsEmpty’,'M’);

‘Format’ specifies floating point for all five variables, and ‘TreatAsEmpty’ specifies that ‘M’ should be treated as an empty field in the file, which is then converted to NaN. This simplifies the process and avoids having to call standardizeMissing and str2double.

Just to say I am a big fan of the dataset & categorical class and have been making heavy use of it since its launch. Before today I saw pleas from users here at MATLAB Central for dataset to be included in base MATLAB, so tables should do the trick. However in my opinion it is disingenuous of TMW to launch tables as “new”, rather than a rebranding of datasets and an (entirely welcome) license change, while omitting to mention datasets in any of the release notes, videos or blogs to accompany the marketing.

As one example, the main doc page for dataset http://www.mathworks.co.uk/help/stats/dataset-arrays.html doesn’t have any reference to equivalent page for table. There is no discussion of when it is appropriate to use a dataset and when to use a table, or an equivalence / migration guide, or a mention of the future intention to deprecate datasets, which has a big impact on my code.

In future I would hope to see:
+ better cross-referencing between tables and datasets in the doc
+ forward guidance for Stats Toolbox users about the road-map for datasets now that we also have tables
+ no undocumented changes made to the class design for tables without remark in the release notes (as happened with datasets)
+ more functions in the Statistics Toolbox can work natively with datasets or tables, e.g. boxplot, parallelcoords just to name 2 off my head.

The only other point to ask is in these days of “Big Data” whether a handle version of table/dataset would be useful? My datasets tend to be large, but making a small change to one variable in a dataset, or just changing the metadata, means copying the entire table and passing large frames on the stack.

Can you comment on improvement in terms of features and performance relative to stat. box datasets for analyzing (multi-indexed) paneldata? Currently, stat. dataset arrays compare rather poorly to R DataFrames and pandas.DataFrame in terms of features and performance. An improvement in this regard would greatly boost our case for not abandoning Matlab, so any comment would be appreciated!

@Brad Stiritz

The easiest way that comes to mind is the function verLessThan. Since tables are new in MATLAB 8.2 (R2013b), you would use the following to determine if the current version supports tables or not:

verLessThan(‘MATLAB’,’8.2′)

If it comes back true, then the current version is older than the current, and does not support tables. There may be a way to directly check the availability of the class as you suggested, but nothing comes to mind at the moment.

HTH,
Adam

@Julian

Glad you are happy with the introduction of tables into MATLAB. However, we really didn’t mean to be disingenuous/sneaky. For new users and users who never had used MATLAB or the Statistics Toolbox, we thought it would be really confusing and un-necessary to explain about datasets. And, they are new in MATLAB and new to a lot of people. We definitely aren’t trying to hide anything. We agree with your request for additional documentation, support for tables, etc. I am going to quote Peter since he gave a great answer to a related question on MATLAB answers –

“Generally speaking, these new data types should look and feel very familiar to anyone who has used the ones in the Statistics Toolbox. One obvious difference is that they are included as part of core MATLAB, and you don’t need to install the Statistics Toolbox to use them. In addition, their design and terminology makes them a bit more accessible for non-statistical uses, though they remain just as useful for statistics.

Tables and categorical arrays are ultimately intended as replacements for dataset, nominal, and ordinal arrays, and we recommend that MATLAB users adopt them for new work. We also recommend that, over time, users update any of their existing code that uses dataset/nominal/ordinal, but we don’t expect that that changeover can happen immediately. Upcoming releases will provide more details and strategies for making the transition.

In R2013b, all of the Statistics Toolbox functionality that uses nominal and ordinal arrays also supports the new categorical arrays. In R2013b, you’ll still need to use dataset arrays in the Statistics Toolbox for things like LinearModel and (new in R2013b) LinearMixedModel, but you might consider creating tables and converting to dataset only when needed, using table2dataset.”

Hope this helps.

Cheers,
Sarah

@Peter

Thank you. We updated the post to show the easier way of dealing with the ‘M’s for missing data.

Also, I just wanted to let everyone know that I wasn’t trying to obscure anything, just trying to simplify the post enough to not make it not way too long and go into too many details. I will see if I can figure out a way to post the files that I used or create an automation script to download it for you automatically.

Cheers,
Sarah

This looks really useful, wondering if it supports dynamic field naming? Is it possible to convert a structure to this new type?

@Pieter

It’s hard to say anything very specific without knowing specifically what you are trying to do and what problems you’ve run into. Tables are very similar to datasets. Some important differences are the new curly brace subscripting behaviors, and the new varfun and rowfun methods for computing on tables in various ways. Often times people want to fit mixed effects models on panel data, and the Statistics Toolbox has that functionality (in the (very) short term, you’ll have to use table2dataset to call fitlme).

You can expect that functionality based around tables will grow in future releases, both in MATLAB itself, and in the Statistics Toolbox and other toolboxes.

If you haven’t done so already, it would be helpful (for you and for us) for you to contact MATLAB support to let us know what features you’re looking for and what performance issues you’ve run into.

Congratulations on the introduction of tables into MATLAB. I’m a longtime user and fan of the dataset array class. I haven’t tried out 2013B yet so have no first hand experience, but some suggestions/requests for features in the future, if they are not already implemented:

(1) A rewrite/efficiency improvement of the underlying subsref method would be fantastic. My biggest complaint with the dataset array class is the slowness associated with indexing using “.” notation in a for loop or the like. Not all operations can be vectorized in a loop and anytime I have to set values something like this:

for n = 2:N
ds.var1(n) = foo(n,ds.var1(n-1));
end

The performance hit is painful to the point that I have to use a local variable:

var1 = ds.var1;
for n = 2:N
var1(n) = foo(n,var1(n-1));
end
ds.var1 = var1;

Not a big deal, but it adds considerable bloat to my code. A small performance hit is inevitable and understandable, however hit in the current implementation (in dataset arrays as of 2012A at least )is considerable.

(2) How are the new tables displayed when using something like cell mode publishing? For my own work, I store formatting information similar to what is used in sprintf (%s, %0.4f, etc.) in the dataset array properties.units. I have written a series of utilities like “dataset2html” that will take an arbitrary dataset array and convert it to an html table with each column formatted appropriately. This is very useful when one column represents a percentage (multiply values by 100 and add a % sign), dollar amounts (insert commas every 3rd digit and add a $ sign), etc. Default formats are used based on the class of each column if not specified. More generally, this could be quite useful for the “disp” method for the tables within the command window, variable editor, etc. It doesn’t need to be restricted to publishing.

(3) Please please please, allow and plan for the idea that users will subclass off of the table class to create their own custom types. I currently have built a fairly elaborate time series class on top of dataset arrays with dozens of custom methods. It is much better suited for my work than the MATLAB time series class. I plan on converting this to use tables as the base class upon upgrade.

Keep it coming!

Thanks

@angela, you might mean one of two things by “dynamic field naming”.

1) You can add and remove variables from a table, and rename them as shown in the blog post, all at run time. So in that sense, variables in a table, and their names, are completely dynamic.
2) You can subscript a table as

t.(varName)

where varName is a string containing a variable name. That’s the same thing as for structure arrays. In addition, tables allow

t.(varIndex)

where varIndex is the scalar containing the index of a variable.

Yes, there are struct2table, cell2table, and array2table functions. You can check out the on-line doc:

http://www.mathworks.com/help/matlab/tables.html

@owr, thanks for the kind words.

1) Consider doing this instead:

function x = vectorizedFoo(x)
for n = 2:length(x)
x(n) = foo(n,x(n-1));
end

and then

ds.var1 = vectorizedFoo(ds.var1);

This encapsulates the operation in a function, as if it were in a toolbox, and solves your performance problem. Obviously, vectorizedFoo is a stupid name.

2) Interesting idea, I will make a note to have the team responsible for publishing consider your use case. I encourage you to submit your function to the file exchange.

3) Initially, table is a sealed class:

http://www.mathworks.com/help/matlab/matlab_oop/control-allowed-subclasses.html

I know this is not what you wanted to hear, but it is not a permanent situation. It allows us to ship the “public interface” to tables while letting the internals bake for a while longer.

How can I simply a new variable with all its entries empty (or nan, or zeros, or whatever)?

All examples in the docs and in this post only involv initialising new variables based on old variables.

The new table datatype looks great. However, why didn’t we get an updated uitable that can accept this new datatype directly? It seems like it would have been fairly easy to implement. Uitable desperately needs some usability enhancements, but this would have been a start.

Sven, you would do that in the same way that you would initialize a variable in the workspace: create it using ZEROS or NAN or whatever, and assign it into the table. Whether or not you actually need to do that is another question. In my experience, hardly ever.

Shad, I’ve posted a reply to your similar question on MATLAB Answers:

http://www.mathworks.com/matlabcentral/answers/86924-r2013b-uitable-can-t-accept-data-from-new-table-data-type

In short, table2cell or brace subscripting lets you pass a table to a uitable.

As to why not accept a table directly, there is never enough time to do everything. I’m sure you can relate to that in your own work. You can expect more functionality to grow around tables in future releases.

Thanks Peter. My use case is where you’re using the table as the storage for some new incrementally updating new variable.

So, step 1: make a table.

load tetmesh.mat
t = array2table(X,’VariableNames’,{‘x’ ‘y’ ‘z’});

Step 2: initialise a new variable in the table:

t.norm = nan % This errors
t.norm(:) = nan % This somehow sets all t.norm(i) entries to an empty double
[t.norm(:)] = deal(nan) % this does the same as above
t.norm = t.x; t.norm(:) = nan; % this does what I want but seems round-about
t.norm = nan(size(t.x)) % This also does what I want but like above it must reference an already-present variable

I was just wondering if there was a convenient syntax that basically takes two parameters (name of the new variable, default value for the new variable), and then assigns the default value to all rows of the table.

My personal preference would be the simplest:

t.newVariable = defaultValue

At the moment this syntax errors, but I think it would be more convenient (and more “MATLAB-like”) than something like:

t.newVariable = defaultValue * ones(size(t.oldVariable))
or
t.newVariable = defaultValue + zeros(t.Height,1))
or
t.newVariable = t.oldVariable; t.newVariable(:) = defaultValue

Would there be any drawbacks with implementing that kind of syntax?

Oh, and I realise that in my example above I could do something in one line such as:

t.norm = sqrt(t.x.^2+t.y.^2+t.z.^2)

But my intention is really something that matches the situation where you’d want every row initialised to some variable, and then (ie, in Step 3) you will loop through the rows to perform a more complex calculation in order to update all (or at least some) rows.

Sven, I’m guessing that your X is just an Nx3 matrix. So:

>> t = array2table(randn(5,3),’VariableNames’,{‘x’ ‘y’ ‘z’})
t =
x y z
_______ ________ _________
0.53767 -1.3077 -1.3499
1.8339 -0.43359 3.0349
-2.2588 0.34262 0.7254
0.86217 3.5784 -0.063055
0.31877 2.7694 0.71474

The following doesn’t work, because it tries to create a new variable that’s not the same height as the existing table:

>> t.norm = NaN
Error using table/subsasgnDot (line 136)
To assign to or create a variable in a table, the number of rows must match the height of the table.

It’s possible that this could “scalar expand” the NaN, and create a 5×1 column of NaNs, but there are likely to be unintended side effects of that. We’d have to consider all of those.

The following says, “create a new variable and assign NaN to all of its elements”:

>> t.norm(:) = NaN
t =
x y z norm
_______ ________ _________ ____________
0.53767 -1.3077 -1.3499 [1x0 double]
1.8339 -0.43359 3.0349 [1x0 double]
-2.2588 0.34262 0.7254 [1x0 double]
0.86217 3.5784 -0.063055 [1x0 double]
0.31877 2.7694 0.71474 [1x0 double]

What’s happened is that the new variable started out as 5×0, and NaN was assigned into all of its elements, but since it had no elements, it remained empty. The [1x0 double] you see is sort of a display artifact; there’s not much other way to display 5 rows of an empty matrix. But norm is a 5×0 array.

Here’s what you can do to get the scalar expansion that you’re looking for — assign to the first column of the new variable:

>> t.norm(:,1) = NaN
t =
x y z norm
_______ ________ _________ ____
0.53767 -1.3077 -1.3499 NaN
1.8339 -0.43359 3.0349 NaN
-2.2588 0.34262 0.7254 NaN
0.86217 3.5784 -0.063055 NaN
0.31877 2.7694 0.71474 NaN

In my experience, most (almost all?) of the time when you need to create a new variable like this, you’re going to want NaN or zero, and so the assignment

>> t.norm = NaN(height(t),1)

does the trick. I will still suggest that you’re better off doing the sort of thing you mentioned in your followup, but I recognize that that may not always be possible.

Hope this helps.

For large datasets using tables is pre-allocation needed? Would it be like structures where you allocate field names but not the actual data size?

Angela, there’s no way to answer that question without knowing what you’re doing. If you would have preallocated workspace variables, then you should consider doing the same thing in a table. Normally you’d do that only if you are doing a lot of scalar assignments to grow a variable incrementally, and tables are going to work best if instead you do large vectorized operations. For that, you likely do not need to create anything in advance.

You can create a 0xN table thatt “preallocates” the variable names and types, but not sizes. You can create an MxN table that “preallocates” all three. Whether you really need to do that is another question.

“You can create a 0xN table thatt “preallocates” the variable names and types, but not sizes.”

How is this possible? A bit more on preallocation in the documentation would be helpful. These datatypes may be useful as class properties for databases, however, they need to be initialised with ‘empty’ variables.

Cheers

Jack, it’s the same as if you were creating empty variables in the workspace. So, for example:

>> t = table(zeros(0,1),zeros(0,1,’int32′),false(0,1))
t =
empty 0-by-3 table
>> summary(t)
Variables:
Var1: 0×1 double
Var2: 0×1 int32
Var3: 0×1 logical

HOWEVER, in my experience, needing to do something like this is an indication that you may be doing something the wrong way. It’s never a good idea to grow variables in MATLAB; here you’d be growing three variables.

Not entirely sure what you mean by “class properties for databases”, so you may be doing something perfectly valid.

These postings are the author's and don't necessarily represent the opinions of MathWorks.