Loren on the Art of MATLAB

Turn ideas into MATLAB

Ways to Perform Calculations on Groups of Data in Tables 4

Posted by Loren Shure,

Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using Tables.

Contents

Create Table

Tables are convenient containers for column-oriented data. The variables in a table can have different data types, but must have the same number of rows. You can access table data by row, by variable, or by variable name. And you can specify groups within table variables, to perform calculations on those groups. Today I will show several different ways to apply functions to groups in tables.

First, let's create a table. You can create a table using the table function, the Import Tool, or the readtable function. I'm going to use readtable to read data from a sample file that ships with MATLAB. The file outages.csv contains simulated data for electric power outages over a period of 12 years in the United States. Read the table. To display the first five rows, use table indexing.

T = readtable('outages.csv','Format','%C%D%f%f%D%C');
T(1:5,:)
ans =
  5×6 table
     Region         OutageTime        Loss     Customers     RestorationTime          Cause     
    _________    ________________    ______    __________    ________________    _______________
    SouthWest    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    winter storm   
    SouthEast    2003-01-23 00:49    530.14    2.1204e+05    NaT                 winter storm   
    SouthEast    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    winter storm   
    West         2004-04-06 05:44    434.81    3.4037e+05    2004-04-06 06:10    equipment fault
    MidWest      2002-03-16 06:18    186.44    2.1275e+05    2002-03-18 23:23    severe storm   

There are six columns in the file. Since I want to specify the data types of the table variables, I use the Format name-value pair argument. T has two categorical variables (specified with %C), two datetime variables (%D) and two numeric variables (%f).

Perform Calculation on Each Numeric Variable

Once data is in a table, you can perform calculations on different table variables. One way is to use dot notation to refer to table variables by name and perform calculations. For example, calculate the total number of customers affected by power outages using the sum function, omitting any NaN values that might be in T.Customers.

totalCustomers = sum(T.Customers,'omitnan')
totalCustomers =
    1.903e+08

You also can apply a function to every table variable, using the varfun function. varfun is particularly useful for applying a function and returning the results in another table.

However, you might want to apply a function only to variables of a particular type. You can specify the type of variable with the vartype function. Here I'm going to subscript into T for the numeric variables and calculate the mean of each one. And since the variables have some NaN values, I'm going to wrap mean in an anonymous function so I can use the 'omitnan' flag.

T2 = T(:,vartype('numeric'));
omean = @(x) mean(x,'omitnan');
meanT = varfun(omean,T2)
meanT =
  1×2 table
    Fun_Loss    Fun_Customers
    ________    _____________
    563.89      1.6693e+05   

Perform Calculation on Groups in Numeric Variables

A more interesting view of the table is to group the data, and start looking for differences between the groups. For example, what is the difference in the mean power loss between the Northeast and Southwest regions?

To answer this question, you can start by specifying one or more table variables as grouping variables. The grouping variables define groups into which you split the other table variables. Then you can apply a function to each group within each table variable and the results. For more on grouping variables, see Grouping Variables to Split Data.

You can use grouping variables with the varfun function I described above. Specify Region as the grouping variable in this table. It's a categorical variable, making it especially convenient to use as a grouping variable. Specify Loss and Customers as the input variables, and calculate the mean values by region for these two variables. I'm going to use the same omean function I specified above.

meanByRegion = varfun(omean,T,'GroupingVariables','Region',...
                      'InputVariables',{'Loss','Customers'})
meanByRegion =
  5×4 table
     Region      GroupCount    Fun_Loss    Fun_Customers
    _________    __________    ________    _____________
    MidWest      142           1137.7      2.4015e+05   
    NorthEast    557           551.65      1.4917e+05   
    SouthEast    389           495.35      1.6776e+05   
    SouthWest     26           493.88      2.6975e+05   
    West         354           433.37      1.5201e+05   

Another way to apply a grouping variable to the data in a table variable is to use the accumarray function. If I convert T.Region to a numeric array, and extract T.Loss from T, then I can calculate mean power loss per region using accumarray, just as I did with varfun.

regions = double(T.Region);
A = accumarray(regions,T.Loss,[],omean)
A =
       1137.7
       551.65
       495.35
       493.88
       433.37

accumarray operates on arrays. So to use accumarray on a table variable, you must extract it from the table. Also, the grouping variable must be numeric, and the output argument is an array, not a table. While you can use accumarray on table variables, it lacks many of the conveniences that varfun provides.

You can specify more than one grouping variable using varfun. For example, I'll specify Region and Cause as grouping variables. The groups are defined as combinations of regions and causes.

meanByRegionAndCause = varfun(omean,T,'GroupingVariables',{'Region','Cause'},...
                              'InputVariables',{'Loss','Customers'});
meanByRegionAndCause(1:5,:)
ans =
  5×5 table
    Region          Cause          GroupCount    Fun_Loss    Fun_Customers
    _______    ________________    __________    ________    _____________
    MidWest    attack              12                 0               0   
    MidWest    energy emergency    19            536.09           57603   
    MidWest    equipment fault      9            343.37           29704   
    MidWest    severe storm        31            1055.7      4.3584e+05   
    MidWest    thunder storm       32            941.07      1.3301e+05   

Multiple Variables as Input Arguments to Function

Another simple calculation is the mean duration of the power outages by region. The basis of this calculation is the difference between the OutageTime and RestorationTime table variables.

You might think to use varfun for this calculation too. The difficulty is that varfun applies a function to each table variable separately. But this calculation requires both variables. In other words, we want to apply a function that requires two input arguments.

The solution is to use the rowfun function. With rowfun, all the table variables are taken as input arguments. If the table has N variables, then the function you apply must accept N input arguments.

To perform this calculation, I'll start by defining an anonymous function that subtracts one input argument from the other, and then calculates the mean. Then I will calculate the mean duration of power outages by region using rowfun.

meanDiff = @(a,b) mean(abs(a-b),'omitnan');
meanOutageTimes = rowfun(meanDiff,T,'GroupingVariables','Region',...
                         'InputVariables',{'OutageTime','RestorationTime'})
meanOutageTimes =
  5×3 table
     Region      GroupCount      Var3   
    _________    __________    _________
    MidWest      142           819:14:44
    NorthEast    557           581:02:18
    SouthEast    389            40:49:49
    SouthWest     26            59:31:07
    West         354           673:27:12

Note that OutageTime and RestorationTime are datetime variables. You can perform arithmetic with datetime variables, just as you can with numeric variables.

To rename a variable, you can access the VariableNames property of the table and change its name there. Tables contain metadata, such as variable names and descriptions, in a property called Properties.

meanOutageTimes.Properties.VariableNames{'Var3'} = 'mean_OutageTime';
meanOutageTimes(1:5,:)
ans =
  5×3 table
     Region      GroupCount    mean_OutageTime
    _________    __________    _______________
    MidWest      142           819:14:44      
    NorthEast    557           581:02:18      
    SouthEast    389            40:49:49      
    SouthWest     26            59:31:07      
    West         354           673:27:12      

Use Groups in Multiple Calculations and Tabulate Results

varfun and rowfun are suitable when you want to perform the same calculation with the table variables. But what if you want to calculate different quantities? For example, you might want to calculate the mean power loss, but also the minimum duration of power outages, by region. In that case, you can perform the calculations and tabulate the results using the findgroups and splitapply functions.

These functions work a little differently from varfun and rowfun. The findgroups function returns numeric indices that correspond to the grouping variables you specify. As a second output, it also returns a table of the groups. Then you can call splitapply to apply a function to each group within the variables.

For example, I'll calculate the mean power loss, the minimum outage time, and the maximum number of customers affected, by region. I will specify the groups once using findgroups. Then I can use the groups in multiple calls to splitapply.

I'm also going to switch gears a little bit, by removing all rows with NaN or missing values. I'll use the rmmissing function to remove these rows, so that I don't have to specify the 'omitnan' flag in every function call.

T = rmmissing(T);
[G,results] = findgroups(T(:,'Region'));
meanLoss = splitapply(@mean,T.Loss,G);
minOutageTime = splitapply(@min,T.RestorationTime - T.OutageTime,G);
maxCustomers = splitapply(@max,T.Customers,G)
maxCustomers =
    3.295e+06
   5.9689e+06
   2.2249e+06
   1.8202e+06
     4.26e+06

The output of splitapply is an array. maxCustomers is a 5-by-1 numeric array. However, you can tabulate the results by adding them as variables to the table results, the second output from findgroups.

results.meanLoss = meanLoss;
results.minOutageTime = minOutageTime;
results.maxCustomers = maxCustomers
results =
  5×4 table
     Region      meanLoss    minOutageTime    maxCustomers
    _________    ________    _____________    ____________
    MidWest      907.19      00:00:00          3.295e+06  
    NorthEast    383.86      00:00:00         5.9689e+06  
    SouthEast    508.34      00:00:00         2.2249e+06  
    SouthWest    541.66      00:28:00         1.8202e+06  
    West         429.73      00:00:00           4.26e+06  

Perform Calculations on Tall Table

So far, I've worked with a table that is small enough to fit into memory. But what if there is so much data that it won't all fit into memory? I might still want to treat it as a table and use the same functions.

That can easily be done by creating a tall table. A tall table is a kind of a tall array in MATLAB. Tall arrays and tall tables know how to read the data in one chunk at a time, perform the calculations you want, and then gather up the output at the end. The syntax for working with a tall table is very similar to that for working with a table.

For more information on tall arrays and tall tables, see Tall Arrays.

To show how this works, I'll create a tall table out of outages.csv. Of course it's a bit silly to create a tall table out of a file that fits in memory, but this illustrates how to work with a much larger table.

Instead of calling readtable, I'll start by calling datastore to read outages.csv into a datastore.

Then I'll call tall on ds, to create a tall table out of the datastore. Now I have a tall table that looks and acts very much like the table I worked with in the previous sections. One difference you see is that the tall table is displayed as an M-by-6 table, showing that the number of rows is not yet known.

ds = datastore('outages.csv');
T = tall(ds)
Starting parallel pool (parpool) using the 'local' profile ...
connected to 2 workers.
T =
  M×6 tall table
      Region          OutageTime        Loss     Customers     RestorationTime           Cause      
    ___________    ________________    ______    __________    ________________    _________________
    'SouthWest'    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    'winter storm'   
    'SouthEast'    2003-01-23 00:49    530.14    2.1204e+05    NaT                 'winter storm'   
    'SouthEast'    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    'winter storm'   
    'West'         2004-04-06 05:44    434.81    3.4037e+05    2004-04-06 06:10    'equipment fault'
    'MidWest'      2002-03-16 06:18    186.44    2.1275e+05    2002-03-18 23:23    'severe storm'   
    'West'         2003-06-18 02:49         0             0    2003-06-18 10:54    'attack'         
    'West'         2004-06-20 14:39    231.29           NaN    2004-06-20 19:16    'equipment fault'
    'West'         2002-06-06 19:28    311.86           NaN    2002-06-07 00:51    'equipment fault'
    :              :                   :         :             :                   :
    :              :                   :         :             :                   :

Now I can use findgroups and splitapply to perform some of the same calculations as before. Again I will wrap mean in an anonymous function so I can ignore NaN values, as I did above.

omean = @(x) mean(x,'omitnan');
T.Region = categorical(T.Region);
T.Cause = categorical(T.Cause);
G = findgroups(T.Region);
meanLoss = splitapply(omean,T.Loss,G);
meanCustomers = splitapply(omean,T.Customers,G)
meanCustomers =
  M×1 tall double column vector
    ?
    ?
    ?
    :
    :

Instead of seeing the output, we see a trail of question marks. What does this mean? Simply that I have not performed any calculations yet. At this point, I have only defined the calculations I want to perform on the tall table. No calculations are performed until I gather the results, by calling the gather function. The output of gather is not a tall array and must fit into memory.

meanLoss = gather(meanLoss);
meanCustomers = gather(meanCustomers)
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 4: Completed in 4 sec
- Pass 2 of 4: Completed in 2 sec
- Pass 3 of 4: Completed in 2 sec
- Pass 4 of 4: Completed in 2 sec
Evaluation completed in 25 sec
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 2 sec
Evaluation completed in 5 sec
meanCustomers =
   2.4015e+05
   1.4917e+05
   1.6776e+05
   2.6975e+05
   1.5201e+05

Now let's tabulate the results in a table. Since meanLoss and meanCustomers are not tall arrays, results is not a tall table. Add the regions as another table variable.

results = table(meanLoss,meanCustomers)
region = unique(T.Region);
results.region = gather(region)
results =
  5×2 table
    meanLoss    meanCustomers
    ________    _____________
    1137.7      2.4015e+05   
    551.65      1.4917e+05   
    495.35      1.6776e+05   
    493.88      2.6975e+05   
    433.37      1.5201e+05   
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 1 sec
Evaluation completed in 2 sec
results =
  5×3 table
    meanLoss    meanCustomers     region  
    ________    _____________    _________
    1137.7      2.4015e+05       MidWest  
    551.65      1.4917e+05       NorthEast
    495.35      1.6776e+05       SouthEast
    493.88      2.6975e+05       SouthWest
    433.37      1.5201e+05       West     

Turning the Tables on You

Here I have shown you several ways you can group data in tables for calculations. To summarize, you can use one or more grouping variables to specify groups within the other table variables. Then you can apply functions to the groups in the table variables.

To apply:

  • A function to a single table variable and return an array, use accumarray.
  • The same function to each table variable and return a table, use varfun.
  • A function that requires all the table variables as input arguments and return a table, use rowfun.
  • Different functions to different table variables and build a table of results, use findgroups and splitapply.

How about you? Have done analysis with tables where accumarray, varfun, rowfun, findgroups and splitapply might have helped? Do you have cases where you need tall tables? Let us know here.


Get the MATLAB code

Published with MATLAB® R2017a

Note

Comments are closed.

4 CommentsOldest to Newest

Richard Oberdieck replied on : 1 of 4

Hi,
thank you very much for your post on tables! I am using them to some extent right now and find them very useful! One question I have though is regarding performance for very large data sets: is there a noticeable difference between rowfun/varfun approaches with tables to bsxfun/arrayfun approaches for simple arrays/matrices?
Thanks
Richard

Stephen Doe replied on : 2 of 4

Hi Richard,

Thank you for your comment! I am glad you find tables to be useful.

It is difficult to make a definitive statement about differences in performance, since the functions were implemented with different goals in mind. For example, arrayfun applies a function to the elements of an input array, one element at a time. When performing ungrouped calculations, varfun and rowfun act on entire table variables. For grouped calculations, the performance of varfun and rowfun also depends on the number of groups you specify. Performance will depend on the calculations you plan to make, the size of the data set, and the number of groups.

If you want to perform grouped calculations on numeric arrays, then findgroups and splitapply, or accumarray, are the most suitable functions, and other things being equal, performance is likely to be comparable to varfun or rowfun.

To perform calculations on table variables, probably your best route is to write your code using varfun or rowfun (whichever is more appropriate), and then test your code on moderately-sized data sets. If you have a case where varfun or rowfun seem to cause poor performance, please contact technical support at MathWorks for assistance.

Best Regards,

Stephen Doe

Jamie Hetherington replied on : 3 of 4

Great article team, seems like lots of great ways to manipulate and perform calculations on data in tables with these functions!

I hope this is ok to post a question here as I have question in regards to attempting to use varfun on a 101×9 table. I am trying to use the function ‘FindMin’ on a table specifying the input variables to get the min value and index of the specific variables which are column vectors. I know you can get a summary of the min, max and median but i’m interested in where the value occurs as well.

FindMin = @(M,I) min(x);
AllMins = varfun(FindMin,NormData,’InputVariables’,’LLatTrunkFlexion2Norm’,’RLatTrunkFlexion2Norm’,’MPV1Norm’,’MPV2Norm});

I can’t pass the input variables from the table into varfun section as I get the error below. I’m sure its got something to do with FindMin function but I’m relatively new to Matlab i’m sure there might be an easy fix and would welcome any suggestions!

Error using tabular/varfun>dfltErrHandler (line 412)
Applying the function ‘@(M,I)min(x) to the variable ‘LLatTrunkFlexion2Norm’ generated the
following error:

Invalid data type. First argument must be numeric or logical.

Error in tabular/varfun>@(s,varargin)dfltErrHandler(grouped,funName,s,varargin{:})
(line 191)
errHandler = @(s,varargin) dfltErrHandler(grouped,funName,s,varargin{:});

Error in tabular/varfun (line 354)
b_data{jvar} = errHandler(s,a_data{jj});

smd137 replied on : 4 of 4

Hi Jamie,

Thanks for the feedback, and I’m glad you found this post useful.

I have a couple of observations to make about your use of the varfun function. First, the error message you cite means that the variable ‘LLatTrunkFlexion2Norm’ is not a numeric variable. Since your FindMin function calls the min function, you need have varfun apply FindMin to the numeric variables only.

You can use varfun and isnumeric to apply FindMin to the numeric variables, as in this call:

AllMins = varfun(FindMin,NormData,’InputVariables’,@isnumeric);

If you have table variables that should be numeric but are some other data type, then you can convert them to numeric variables using the appropriate function. For example, if the variable ‘LLatTrunkFlexion2Norm’ is text, you can use the str2double function to convert it to a numeric variable. Then you can call varfun on the table that contains the converted variables.

Second, as you have written it in your comment, FindMin accepts two input arguments. But from your description, I think you want the two output arguments that the min function can return. However, you cannot instruct varfun to return multiple output arguments from the applied function. So one way to approach this problem is to call varfun twice. The first time, call it using min, to get the minima of the table variables. The second time, call it using a function that find the locations of the minima. In all likelihood, you would need to write a function of your own that gets the second output from min, and just returns that. The result is two tables that you can horizontally concatenate, if you want the minima and locations in one table.

An alternative approach might be to try the rowfun function, which can return multiple outputs from the applied function. In that case, you would write FindMin as a function that accepts N inputs (as you have N table variables), and that returns 2*N outputs (the minimum for each variable, and its location).

A final note of caution: if you group the table variables, you will get misleading results from either approach. The reason is that if you group the variables, then you will find minima and their locations within the groups, not within the table as a whole. If you wish to index back into the table using locations, then the locations within groups are of no use. Keep this information in mind if you later decide that you need to group your data.

I hope my comments give you some helpful guidance. Please contact MathWorks if you find you need further assistance.

Best Regards,

Stephen Doe