Loren on the Art of MATLAB

Turn ideas into MATLAB

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

New Ways to Arrange and Plot Data in Tables 6

Posted by Loren Shure,

Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. In today's post, Stephen shows us new functions for displaying, arranging, and plotting data in tables and timetables.

Contents

Tables, Then and Now

In R2013b, MATLAB® introduced the table data type, as a convenient container for column-oriented data. And in R2016b, MATLAB introduced the timetable data type, which is a table that has timestamped rows.

From the beginning, these data types offered advantages over cell arrays and structures. But over the course of several releases, the table and graphics development teams have added many new functions for tables and timetables. These functions add convenient ways to display and arrange tabular data. Also, they offer new ways to make plots or charts directly from tables, without the intermediate step of peeling out variables. As of R2018b, MATLAB boasts many new functions to help you make more effective use of tables and timetables.

Read Table and Display First Few Rows

To begin, I will use the readtable function to read data from a sample file that ships with MATLAB. The file outages.csv contains simulated data for electric power outages over a period of 12 years in the United States. The call to readtable returns a table, T, with six variables and 1468 rows, so I will suppress the output using a semicolon.

T = readtable('outages.csv');

One typical way to examine the data in a large table is to display the first few rows of the table. You can use indexing to access a subset of rows (and/or a subset of variables, for that matter). For example, this syntax returns the first three rows of T.

T(1:3,:)
ans =
  3×6 table
      Region          OutageTime        Loss     Customers     RestorationTime         Cause     
    ___________    ________________    ______    __________    ________________    ______________
    'SouthWest'    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    'winter storm'
    'SouthEast'    2003-01-23 00:49    530.14    2.1204e+05                 NaT    'winter storm'
    'SouthEast'    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    'winter storm'

I have a confession to make: I have written many table examples, using that syntax. And occasionally, I still catch myself starting with code like T(3,:), which accesses only one row.

Happily, in R2016b we added the head function to return the top rows of a table. Here's the call to return the first three rows using the head function.

head(T,3)
ans =
  3×6 table
      Region          OutageTime        Loss     Customers     RestorationTime         Cause     
    ___________    ________________    ______    __________    ________________    ______________
    'SouthWest'    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    'winter storm'
    'SouthEast'    2003-01-23 00:49    530.14    2.1204e+05                 NaT    'winter storm'
    'SouthEast'    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    'winter storm'

Similarly, tail returns the bottom rows of a table. (If you do not specify the number of rows, then head and tail return eight rows.)

Move, Add, and Delete Table Variables

After examining your table, you might find that you want to organize your table by moving related variables next to each other. For example, in T you might want to move Region and Cause so that they are together.

One way to move table variables is by indexing. But if you use indexing and want to keep all the variables, then you must specify them all in order, as shown in this syntax.

T = T(:,{'OutageTime','Loss','Customers','RestorationTime','Region','Cause'})

You also can use numeric indices. While more compact, this syntax is less readable.

T = T(:,[2:5 1 6])

When your table has many variables, it is awkward to move variables using indexing. Starting in R2018a, you can use the movevars function instead. Using movevars, you only have to specify the variables of interest. Move the Region variable so it is before Cause.

T = movevars(T,'Region','Before','Cause');
head(T,3)
ans =
  3×6 table
       OutageTime        Loss     Customers     RestorationTime       Region           Cause     
    ________________    ______    __________    ________________    ___________    ______________
    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    'SouthWest'    'winter storm'
    2003-01-23 00:49    530.14    2.1204e+05                 NaT    'SouthEast'    'winter storm'
    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    'SouthEast'    'winter storm'

It is also likely that you want to add data to your table. For example, let's calculate the duration of the power outages in T. Specify the format to display the duration in days.

OutageDuration = T.RestorationTime - T.OutageTime;
OutageDuration.Format = 'dd:hh:mm:ss';

It is easy to add OutageDuration to the end of a table using dot notation.

T.OutageDuration = OutageDuration;

However, you might want to add it at another location in T. In R2018a, you can use the addvars function. Add OutageDuration so that it is after OutageTime.

T = addvars(T,OutageDuration,'After','OutageTime');
head(T,3)
ans =
  3×7 table
       OutageTime       OutageDuration     Loss     Customers     RestorationTime       Region           Cause     
    ________________    ______________    ______    __________    ________________    ___________    ______________
    2002-02-01 12:18     06:04:32:00      458.98    1.8202e+06    2002-02-07 16:50    'SouthWest'    'winter storm'
    2003-01-23 00:49             NaN      530.14    2.1204e+05                 NaT    'SouthEast'    'winter storm'
    2003-02-07 21:15     09:10:59:00       289.4    1.4294e+05    2003-02-17 08:14    'SouthEast'    'winter storm'

Now, let's remove RestorationTime. You can easily remove variables using dot notation and an empty array.

T.RestorationTime = [];

However, in R2018a there is also a function to remove table variables. To remove RestorationTime, use the removevars function.

T = removevars(T,'RestorationTime');
head(T,3)
ans =
  3×6 table
       OutageTime       OutageDuration     Loss     Customers       Region           Cause     
    ________________    ______________    ______    __________    ___________    ______________
    2002-02-01 12:18     06:04:32:00      458.98    1.8202e+06    'SouthWest'    'winter storm'
    2003-01-23 00:49             NaN      530.14    2.1204e+05    'SouthEast'    'winter storm'
    2003-02-07 21:15     09:10:59:00       289.4    1.4294e+05    'SouthEast'    'winter storm'

Convert to Timetable

If your table contains dates and times in a datetime array, you can easily convert it to a timetable using the table2timetable function. In this example, table2timetable converts the values in OutageTime to row times. Row times are time stamps that label the rows of a timetable.

TT = table2timetable(T);
head(TT,3)
ans =
  3×5 timetable
       OutageTime       OutageDuration     Loss     Customers       Region           Cause     
    ________________    ______________    ______    __________    ___________    ______________
    2002-02-01 12:18     06:04:32:00      458.98    1.8202e+06    'SouthWest'    'winter storm'
    2003-01-23 00:49             NaN      530.14    2.1204e+05    'SouthEast'    'winter storm'
    2003-02-07 21:15     09:10:59:00       289.4    1.4294e+05    'SouthEast'    'winter storm'

When you display a timetable, it looks very similar to a table. One important difference is that a timetable has fewer variables than you might expect by glancing at the display. TT has five variables, not six. The vector of row times, OutageTime, is not considered a timetable variable, since its values label the rows. However, you can still access the row times using dot notation, as in T.OutageTime. You can use the vector of row times as an input argument to a function. For example, you can use it as the x-axis of a plot.

The row times of a timetable do not have to be ordered. If you want to be sure that the rows of a timetable are sorted by the row times, use the sortrows function.

TT = sortrows(TT);
head(TT,3)
ans =
  3×5 timetable
       OutageTime       OutageDuration     Loss     Customers       Region           Cause     
    ________________    ______________    ______    __________    ___________    ______________
    2002-02-01 12:18     06:04:32:00      458.98    1.8202e+06    'SouthWest'    'winter storm'
    2002-03-05 17:53     04:20:48:00      96.563    2.8666e+05    'MidWest'      'wind'        
    2002-03-16 06:18     02:17:05:00      186.44    2.1275e+05    'MidWest'      'severe storm'

Make Stacked Plot of Variables

Now I will show you why I converted T to a timetable. Starting in R2018b, you can plot the variables of a table or timetable in a stacked plot. In a stacked plot, the variables are plotted in separate y-axes, but using a common x-axis. And if you make a stacked plot from a timetable, the x-values are the row times.

To plot the variables of TT, use the stackedplot function. The function plots variables that can be plotted (such as numeric, datetime, and categorical arrays) and ignores variables that cannot be plotted. stackedplot also returns properties of the stacked plot as an object that allows customization of the stacked plot.

s = stackedplot(TT)
s = 
  StackedLineChart with properties:

         SourceTable: [1468×5 timetable]
    DisplayVariables: {'OutageDuration'  'Loss'  'Customers'}
               Color: [0 0.4470 0.7410]
           LineStyle: '-'
           LineWidth: 0.5000
              Marker: 'none'
          MarkerSize: 6

  Use GET to show all properties

One thing you can tell right away from this plot is that there must be a few timetable rows with bad data. There is one point for a power outage that supposedly lasted for over 9,000 days (or 24 years), which would mean it ended some time in the 2040s.

Convert Variables in Place

The stackedplot function ignored the Region and Cause variables, because these variables are cell arrays of character vectors. You might want to convert these variables to a different, and more useful, data type. While you can convert variables one at a time, there is now a more convenient way to convert all table variables of a specified data type.

Starting in R2018b, you can convert table variables in place using the convertvars function. For example, identify all the cell arrays of character vectors in TT (using iscellstr) and convert them to categorical arrays. Now Region and Cause contain discrete values assigned to categories. Categorical values are displayed without any quotation marks.

TT = convertvars(TT,@iscellstr,'categorical');
head(TT,3)
ans =
  3×5 timetable
       OutageTime       OutageDuration     Loss     Customers      Region         Cause    
    ________________    ______________    ______    __________    _________    ____________
    2002-02-01 12:18     06:04:32:00      458.98    1.8202e+06    SouthWest    winter storm
    2002-03-05 17:53     04:20:48:00      96.563    2.8666e+05    MidWest      wind        
    2002-03-16 06:18     02:17:05:00      186.44    2.1275e+05    MidWest      severe storm

Plots of Discrete Data

If your table or timetable has variables with values that belong to a finite set of discrete categories, then there are other interesting plots that you can make. Starting in R2017a, you can make a heat map of any two variables that contain discrete values using the heatmap function. For example, make a heat map of the Region and Cause variables to visualize where and why outages occur. Again, heatmap returns an object so you can customize the plot.

h = heatmap(TT,'Region','Cause')
h = 
  HeatmapChart (Count of Cause vs. Region) with properties:

      SourceTable: [1468×5 timetable]
        XVariable: 'Region'
        YVariable: 'Cause'
    ColorVariable: ''
      ColorMethod: 'count'

  Use GET to show all properties

You also can make a pie chart of any categorical variable (as of R2014b), using the pie function. However, you cannot call pie on a table. So, to make a pie chart of the power outages by region, use dot notation to access the Region variable.

pie(TT.Region)

Other Functions to Rearrange or Join Tables

MATLAB also has other functions to reorganize variables in more complex ways, and to join tables. I won't show them all in action, but I will describe some of them briefly. All these functions work with both tables and timetables.

R2018a includes functions to:

  • Reorient rows to become variables (rows2vars)

And from the original release of tables in R2013b, there are functions to:

Tabled for Discussion

Let's table discussion of these new functions for now. But we are eager to hear about your reactions to them. Do they help you make more effective use of tables and timetables? Please let us know here.


Get the MATLAB code

Published with MATLAB® R2018b

6 CommentsOldest to Newest

Sam Boulton replied on : 1 of 6
I'm using tables all the time in my work, but I'm curious as whether there are any rules of thumb to decide upon when to make a variable categorical or not. At what stage does the ratio of independent entries to length of categorical array have an impact on performance? Ideally, I would like to code up some rules to translate a table's variable to categorical or not to optimise performance and memory footprint. Do you have any insights? Thanks, Sam
@Sam- You get memory benefits very quickly with Categorical. The cases where you may not are if you've got a huge number of categories and many of them single instances only. Anything you do with the categorical should be faster than doing it with the equivalent character array. I think the performance conditions where you may not want to use categorical are fairly pathological. Perhaps I'm missing something you are considering? --Loren
Stephen Doe replied on : 3 of 6
@Sam – To answer your question, I think it’s useful to take a step back and first think about the data types that are suitable for storing your data. I think if you pick the right data type, then it is probably unnecessary to craft a rule of thumb for converting to categorical when you have “enough” repeated values in your data. Since I am a doc writer, let me paraphrase the documentation: categorical arrays store values that belong to a finite set of discrete categories. (Though keep in mind that "finite" does not mean fixed–you can add and delete categories from the array.) So categorical is a very useful data type when values can belong only to specified categories and no others. For example, suppose one variable in your table is a set of colors, and those colors can only be “red”, “green”, or “blue”. Then categorical is an excellent choice. For one thing, it prevents you from entering values that are misspelled, or just don’t belong, such as “yellow” – you have to add a new category before you can insert an element whose category is “yellow”. And if your categorical array is large and the number of categories is small, then it also lets you store and manipulate elements of the array efficiently. Much more efficiently than a cell array of character vectors. But suppose you have an array whose elements don’t belong to categories, but do happen to contain many repeated values. For example, suppose one of your variables is a list of surnames. It might happen that some surnames are frequently repeated in that list. But as you add more rows to your table, there’s no reason why you couldn’t add a row that contains a new surname, one that has not yet appeared anywhere in your table. In that case, I don’t think categorical is a particularly good choice. It’s going to be a pain to add a new category just to capture a new surname. If you have R2016b or later, then consider using a string array instead. The string data type is a homogeneous data type, designed to store multiple pieces of text. If the list of surnames has many repeated names, then a string array also gets you memory benefits. It is more efficient than a cell array. And you can also use relational operators to compare strings. You can store a string array as a table variable. It might give you everything you were looking for in a categorical array. To sum up, I’d be reluctant to create a rule to convert data to categorical (or any other type) only in particular circumstances. Instead, I think your code will be simpler if you pick the data types that are right for your data, and then use those types consistently in your table and code. I hope that answer is helpful. Please get in touch if you have any other comments or questions. – Stephen
bushra raza replied on : 4 of 6
Hi, i am a big fan of utilizing timetables in Matlab. very good work done Matlab Experts. i am eager to learn would these new ways be helpful for my project. please let me tell you something of my university project. i have two data sets of high water level values : Observed data set in the form of uni-variate time series. (1961-2016) and Simulated time series data set in nc files. (1971-2098) i have successfully extracted the two time series in Timetables. now i have two time tables, each having one time series data. [ each table has only timestamp and one value variable column] i need to check how well the simulated data represents the observed data. or whether the simulated data represent the past values of observed data well? along with the trend in these time series. please guide me how can i extract these findings from the timetables ? looking forward to your reply eagerly.. Mrs.Raza
bushra raza replied on : 5 of 6
Hi, i want to ask about stackedplot function. if i have different timetables extracted using timerange from one big timetable , and i need to plot these extracted data , then could stackedplot be used and how? Mrs.Raza
Stephen Doe replied on : 6 of 6
@Mrs. Raza – From your comments, I take it that you have two timetables with different time vectors. One timetable has observed data points. The other timetable has simulated data points – simulated at times that are different from the times in the first timetable. You want to compare simulated data points to observed data points. My advice is to use the synchronize function to put both sets of data points into one timetable. Since the two input timetables have different time vectors, you will have to modify at least one set of data points. My advice is to modify the simulated data points. The observed data points are actual measurements, so I would not modify them. Suppose the two timetables are TTobs (for observed) and TTsim (for simulated). Here's how you can call synchronize, using the times from TTobs as the time vector that goes in the output timetable. You can specify linear interpolation as the method used to synchronize the data from the input timetables to one time vector.
TT = synchronize(TTobs,TTsim,TTobs.Time,'linear')
If linear interpolation won't do, then you can specify some other method, such as spline or cubic interpolation. This call to synchronize should modify only your simulated data. The data points that were in TTobs should be unchanged, because those points already corresponded to the times in TTobs.Time. If you want to be absolutely sure that the data in TTobs remains untouched, then another way to do this is to call the retime function on TTsim only, using the times in TTobs.Time. Then you can horizontally concatenate the timetables together.
TTsim2 = retime(TTsim,TTobs.Time,'linear');
TT = [TTobs TTsim2]
Here I am doing the exact same thing that I did in my previous call to the synchronize function. I'm just doing it in two steps instead of one. Now there is one timetable that contains observed and simulated data points corresponding to the same time vector. This is when I would call some statistical function to check whether the simulated data points match the observed data points well. Without knowing more about your data, I'm not sure which function would be appropriate. MATLAB comes with some useful functions you will find described in the documentation for Descriptive Statistics. Or it might be appropriate to use the Statistics and Machine Learning Toolbox. Finally, you can use the stackedplot function to display observed and simulated data together.
TT = synchronize(TTobs,TTsim,TTobs.Time,'linear');
s = stackedplot(TT)
The stackedplot function plots on a common x-axis. But that x-axis does not have to be the times from TTobs.Time. Suppose you don't want to modify the simulated data points, and you want all the times from both TTobs and TTsim. To do that, call synchronize so that the output timetable has a time vector that is the union of the times from TTobs and TTsim. And instead of specifying an interpolation method, you can let synchronize fill gaps in the output timetable with NaNs. When you plot the resulting timetable, stackedplot ignores the NaNs. (You also can change the marker and line style to show points, instead of connecting them with lines.)
TT = synchronize(TTobs,TTsim,'union');
s = stackedplot(TT);
s.LineStyle = 'none';
s.Marker = '+'
I hope these small examples are helpful. Please get back in touch if you have any other comments or questions. – Stephen

Add A Comment

Your email address will not be published. Required fields are marked *

Preview: hide