{"id":2365,"date":"2017-08-09T08:38:46","date_gmt":"2017-08-09T13:38:46","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=2365"},"modified":"2017-07-18T12:07:48","modified_gmt":"2017-07-18T17:07:48","slug":"ways-to-perform-calculations-on-groups-of-data-in-tables","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2017\/08\/09\/ways-to-perform-calculations-on-groups-of-data-in-tables\/","title":{"rendered":"Ways to Perform Calculations on Groups of Data in Tables"},"content":{"rendered":"\r\n\r\n<div class=\"content\"><!--introduction--><p>Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/tables.html\">Tables<\/a>.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#fcb46ba3-17ed-4617-aa8a-497e381acfb4\">Create Table<\/a><\/li><li><a href=\"#b448eb87-f56d-4071-8bfc-77ae644cf463\">Perform Calculation on Each Numeric Variable<\/a><\/li><li><a href=\"#e506c6e7-cdcf-4e1d-829b-6cd36ee9e23b\">Perform Calculation on Groups in Numeric Variables<\/a><\/li><li><a href=\"#2648e355-23d8-42e9-ba4d-1392062b5bd9\">Multiple Variables as Input Arguments to Function<\/a><\/li><li><a href=\"#ea16dba3-9059-45b1-94d0-5ec240d1ecb9\">Use Groups in Multiple Calculations and Tabulate Results<\/a><\/li><li><a href=\"#8107ab57-c003-46ce-a17e-1a5674335742\">Perform Calculations on Tall Table<\/a><\/li><li><a href=\"#f926ecee-77fd-435f-b9c5-a61001a1cdbd\">Turning the Tables on You<\/a><\/li><\/ul><\/div><h4>Create Table<a name=\"fcb46ba3-17ed-4617-aa8a-497e381acfb4\"><\/a><\/h4><p>Tables are convenient containers for column-oriented data. The variables in a table can have different data types, but must have the same number of rows. You can access table data by row, by variable, or by variable name. And you can specify groups within table variables, to perform calculations on those groups. Today I will show several different ways to apply functions to groups in tables.<\/p><p>First, let's create a table. You can create a table using the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html\">table<\/a><\/tt> function, the <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/importtool-app.html\">Import Tool<\/a>, or the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\">readtable<\/a><\/tt> function. I'm going to use <tt>readtable<\/tt> to read data from a sample file that ships with MATLAB. The file <tt>outages.csv<\/tt> contains simulated data for electric power outages over a period of 12 years in the United States. Read the table. To display the first five rows, use table indexing.<\/p><pre class=\"codeinput\">T = readtable(<span class=\"string\">'outages.csv'<\/span>,<span class=\"string\">'Format'<\/span>,<span class=\"string\">'%C%D%f%f%D%C'<\/span>);\r\nT(1:5,:)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  5&times;6 table\r\n     Region         OutageTime        Loss     Customers     RestorationTime          Cause     \r\n    _________    ________________    ______    __________    ________________    _______________\r\n    SouthWest    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    winter storm   \r\n    SouthEast    2003-01-23 00:49    530.14    2.1204e+05    NaT                 winter storm   \r\n    SouthEast    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    winter storm   \r\n    West         2004-04-06 05:44    434.81    3.4037e+05    2004-04-06 06:10    equipment fault\r\n    MidWest      2002-03-16 06:18    186.44    2.1275e+05    2002-03-18 23:23    severe storm   \r\n<\/pre><p>There are six columns in the file. Since I want to specify the data types of the table variables, I use the <tt>Format<\/tt> name-value pair argument. <tt>T<\/tt> has two categorical variables (specified with <tt>%C<\/tt>), two datetime variables (<tt>%D<\/tt>) and two numeric variables (<tt>%f<\/tt>).<\/p><h4>Perform Calculation on Each Numeric Variable<a name=\"b448eb87-f56d-4071-8bfc-77ae644cf463\"><\/a><\/h4><p>Once data is in a table, you can perform calculations on different table variables. One way is to use dot notation to refer to table variables by name and perform calculations. For example, calculate the total number of customers affected by power outages using the <tt>sum<\/tt> function, omitting any <tt>NaN<\/tt> values that might be in <tt>T.Customers<\/tt>.<\/p><pre class=\"codeinput\">totalCustomers = sum(T.Customers,<span class=\"string\">'omitnan'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">totalCustomers =\r\n    1.903e+08\r\n<\/pre><p>You also can apply a function to every table variable, using the <tt>varfun<\/tt> function. <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/varfun.html\">varfun<\/a><\/tt> is particularly useful for applying a function and returning the results in another table.<\/p><p>However, you might want to apply a function only to variables of a particular type. You can specify the type of variable with the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/vartype.html\">vartype<\/a><\/tt> function. Here I'm going to subscript into <tt>T<\/tt> for the numeric variables and calculate the mean of each one. And since the variables have some <tt>NaN<\/tt> values, I'm going to wrap <tt>mean<\/tt> in an anonymous function so I can use the <tt>'omitnan'<\/tt> flag.<\/p><pre class=\"codeinput\">T2 = T(:,vartype(<span class=\"string\">'numeric'<\/span>));\r\nomean = @(x) mean(x,<span class=\"string\">'omitnan'<\/span>);\r\nmeanT = varfun(omean,T2)\r\n<\/pre><pre class=\"codeoutput\">meanT =\r\n  1&times;2 table\r\n    Fun_Loss    Fun_Customers\r\n    ________    _____________\r\n    563.89      1.6693e+05   \r\n<\/pre><h4>Perform Calculation on Groups in Numeric Variables<a name=\"e506c6e7-cdcf-4e1d-829b-6cd36ee9e23b\"><\/a><\/h4><p>A more interesting view of the table is to group the data, and start looking for differences between the groups. For example, what is the difference in the mean power loss between the Northeast and Southwest regions?<\/p><p>To answer this question, you can start by specifying one or more table variables as <i>grouping variables<\/i>. The grouping variables define groups into which you split the other table variables. Then you can apply a function to each group within each table variable and the results. For more on grouping variables, see <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/grouping-variables-for-splitting-data.html\">Grouping Variables to Split Data<\/a>.<\/p><p>You can use grouping variables with the <tt>varfun<\/tt> function I described above. Specify <tt>Region<\/tt> as the grouping variable in this table. It's a categorical variable, making it especially convenient to use as a grouping variable. Specify <tt>Loss<\/tt> and <tt>Customers<\/tt> as the input variables, and calculate the mean values by region for these two variables. I'm going to use the same <tt>omean<\/tt> function I specified above.<\/p><pre class=\"codeinput\">meanByRegion = varfun(omean,T,<span class=\"string\">'GroupingVariables'<\/span>,<span class=\"string\">'Region'<\/span>,<span class=\"keyword\">...<\/span>\r\n                      <span class=\"string\">'InputVariables'<\/span>,{<span class=\"string\">'Loss'<\/span>,<span class=\"string\">'Customers'<\/span>})\r\n<\/pre><pre class=\"codeoutput\">meanByRegion =\r\n  5&times;4 table\r\n     Region      GroupCount    Fun_Loss    Fun_Customers\r\n    _________    __________    ________    _____________\r\n    MidWest      142           1137.7      2.4015e+05   \r\n    NorthEast    557           551.65      1.4917e+05   \r\n    SouthEast    389           495.35      1.6776e+05   \r\n    SouthWest     26           493.88      2.6975e+05   \r\n    West         354           433.37      1.5201e+05   \r\n<\/pre><p>Another way to apply a grouping variable to the data in a table variable is to use the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/accumarray.html\">accumarray<\/a><\/tt> function. If I convert <tt>T.Region<\/tt> to a numeric array, and extract <tt>T.Loss<\/tt> from <tt>T<\/tt>, then I can calculate mean power loss per region using <tt>accumarray<\/tt>, just as I did with <tt>varfun<\/tt>.<\/p><pre class=\"codeinput\">regions = double(T.Region);\r\nA = accumarray(regions,T.Loss,[],omean)\r\n<\/pre><pre class=\"codeoutput\">A =\r\n       1137.7\r\n       551.65\r\n       495.35\r\n       493.88\r\n       433.37\r\n<\/pre><p><tt>accumarray<\/tt> operates on arrays. So to use <tt>accumarray<\/tt> on a table variable, you must extract it from the table. Also, the grouping variable must be numeric, and the output argument is an array, not a table. While you can use <tt>accumarray<\/tt> on table variables, it lacks many of the conveniences that <tt>varfun<\/tt> provides.<\/p><p>You can specify more than one grouping variable using <tt>varfun<\/tt>. For example, I'll specify <tt>Region<\/tt> and <tt>Cause<\/tt> as grouping variables. The groups are defined as combinations of regions and causes.<\/p><pre class=\"codeinput\">meanByRegionAndCause = varfun(omean,T,<span class=\"string\">'GroupingVariables'<\/span>,{<span class=\"string\">'Region'<\/span>,<span class=\"string\">'Cause'<\/span>},<span class=\"keyword\">...<\/span>\r\n                              <span class=\"string\">'InputVariables'<\/span>,{<span class=\"string\">'Loss'<\/span>,<span class=\"string\">'Customers'<\/span>});\r\nmeanByRegionAndCause(1:5,:)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  5&times;5 table\r\n    Region          Cause          GroupCount    Fun_Loss    Fun_Customers\r\n    _______    ________________    __________    ________    _____________\r\n    MidWest    attack              12                 0               0   \r\n    MidWest    energy emergency    19            536.09           57603   \r\n    MidWest    equipment fault      9            343.37           29704   \r\n    MidWest    severe storm        31            1055.7      4.3584e+05   \r\n    MidWest    thunder storm       32            941.07      1.3301e+05   \r\n<\/pre><h4>Multiple Variables as Input Arguments to Function<a name=\"2648e355-23d8-42e9-ba4d-1392062b5bd9\"><\/a><\/h4><p>Another simple calculation is the mean duration of the power outages by region. The basis of this calculation is the difference between the <tt>OutageTime<\/tt> and <tt>RestorationTime<\/tt> table variables.<\/p><p>You might think to use <tt>varfun<\/tt> for this calculation too. The difficulty is that <tt>varfun<\/tt> applies a function to each table variable separately. But this calculation requires both variables. In other words, we want to apply a function that requires two input arguments.<\/p><p>The solution is to use the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/rowfun.html\">rowfun<\/a><\/tt> function. With <tt>rowfun<\/tt>, all the table variables are taken as input arguments. If the table has <tt>N<\/tt> variables, then the function you apply must accept <tt>N<\/tt> input arguments.<\/p><p>To perform this calculation, I'll start by defining an anonymous function that subtracts one input argument from the other, and then calculates the mean. Then I will calculate the mean duration of power outages by region using <tt>rowfun<\/tt>.<\/p><pre class=\"codeinput\">meanDiff = @(a,b) mean(abs(a-b),<span class=\"string\">'omitnan'<\/span>);\r\nmeanOutageTimes = rowfun(meanDiff,T,<span class=\"string\">'GroupingVariables'<\/span>,<span class=\"string\">'Region'<\/span>,<span class=\"keyword\">...<\/span>\r\n                         <span class=\"string\">'InputVariables'<\/span>,{<span class=\"string\">'OutageTime'<\/span>,<span class=\"string\">'RestorationTime'<\/span>})\r\n<\/pre><pre class=\"codeoutput\">meanOutageTimes =\r\n  5&times;3 table\r\n     Region      GroupCount      Var3   \r\n    _________    __________    _________\r\n    MidWest      142           819:14:44\r\n    NorthEast    557           581:02:18\r\n    SouthEast    389            40:49:49\r\n    SouthWest     26            59:31:07\r\n    West         354           673:27:12\r\n<\/pre><p>Note that <tt>OutageTime<\/tt> and <tt>RestorationTime<\/tt> are datetime variables. You can perform arithmetic with datetime variables, just as you can with numeric variables.<\/p><p>To rename a variable, you can access the <tt>VariableNames<\/tt> property of the table and change its name there. Tables contain metadata, such as variable names and descriptions, in a property called <tt>Properties<\/tt>.<\/p><pre class=\"codeinput\">meanOutageTimes.Properties.VariableNames{<span class=\"string\">'Var3'<\/span>} = <span class=\"string\">'mean_OutageTime'<\/span>;\r\nmeanOutageTimes(1:5,:)\r\n<\/pre><pre class=\"codeoutput\">ans =\r\n  5&times;3 table\r\n     Region      GroupCount    mean_OutageTime\r\n    _________    __________    _______________\r\n    MidWest      142           819:14:44      \r\n    NorthEast    557           581:02:18      \r\n    SouthEast    389            40:49:49      \r\n    SouthWest     26            59:31:07      \r\n    West         354           673:27:12      \r\n<\/pre><h4>Use Groups in Multiple Calculations and Tabulate Results<a name=\"ea16dba3-9059-45b1-94d0-5ec240d1ecb9\"><\/a><\/h4><p><tt>varfun<\/tt> and <tt>rowfun<\/tt> are suitable when you want to perform the same calculation with the table variables. But what if you want to calculate different quantities? For example, you might want to calculate the mean power loss, but also the minimum duration of power outages, by region. In that case, you can perform the calculations and tabulate the results using the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/findgroups.html\">findgroups<\/a><\/tt> and <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/splitapply.html\">splitapply<\/a><\/tt> functions.<\/p><p>These functions work a little differently from <tt>varfun<\/tt> and <tt>rowfun<\/tt>. The <tt>findgroups<\/tt> function returns numeric indices that correspond to the grouping variables you specify. As a second output, it also returns a table of the groups. Then you can call <tt>splitapply<\/tt> to apply a function to each group within the variables.<\/p><p>For example, I'll calculate the mean power loss, the minimum outage time, and the maximum number of customers affected, by region. I will specify the groups once using <tt>findgroups<\/tt>. Then I can use the groups in multiple calls to <tt>splitapply<\/tt>.<\/p><p>I'm also going to switch gears a little bit, by removing all rows with <tt>NaN<\/tt> or missing values. I'll use the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/rmmissing.html\">rmmissing<\/a><\/tt> function to remove these rows, so that I don't have to specify the <tt>'omitnan'<\/tt> flag in every function call.<\/p><pre class=\"codeinput\">T = rmmissing(T);\r\n[G,results] = findgroups(T(:,<span class=\"string\">'Region'<\/span>));\r\nmeanLoss = splitapply(@mean,T.Loss,G);\r\nminOutageTime = splitapply(@min,T.RestorationTime - T.OutageTime,G);\r\nmaxCustomers = splitapply(@max,T.Customers,G)\r\n<\/pre><pre class=\"codeoutput\">maxCustomers =\r\n    3.295e+06\r\n   5.9689e+06\r\n   2.2249e+06\r\n   1.8202e+06\r\n     4.26e+06\r\n<\/pre><p>The output of <tt>splitapply<\/tt> is an array. <tt>maxCustomers<\/tt> is a 5-by-1 numeric array. However, you can tabulate the results by adding them as variables to the table <tt>results<\/tt>, the second output from <tt>findgroups<\/tt>.<\/p><pre class=\"codeinput\">results.meanLoss = meanLoss;\r\nresults.minOutageTime = minOutageTime;\r\nresults.maxCustomers = maxCustomers\r\n<\/pre><pre class=\"codeoutput\">results =\r\n  5&times;4 table\r\n     Region      meanLoss    minOutageTime    maxCustomers\r\n    _________    ________    _____________    ____________\r\n    MidWest      907.19      00:00:00          3.295e+06  \r\n    NorthEast    383.86      00:00:00         5.9689e+06  \r\n    SouthEast    508.34      00:00:00         2.2249e+06  \r\n    SouthWest    541.66      00:28:00         1.8202e+06  \r\n    West         429.73      00:00:00           4.26e+06  \r\n<\/pre><h4>Perform Calculations on Tall Table<a name=\"8107ab57-c003-46ce-a17e-1a5674335742\"><\/a><\/h4><p>So far, I've worked with a table that is small enough to fit into memory. But what if there is so much data that it won't all fit into memory? I might still want to treat it as a table and use the same functions.<\/p><p>That can easily be done by creating a <i>tall table<\/i>. A tall table is a kind of a tall array in MATLAB. Tall arrays and tall tables know how to read the data in one chunk at a time, perform the calculations you want, and then gather up the output at the end. The syntax for working with a tall table is very similar to that for working with a table.<\/p><p>For more information on tall arrays and tall tables, see <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/tall-arrays.html\">Tall Arrays<\/a>.<\/p><p>To show how this works, I'll create a tall table out of <tt>outages.csv<\/tt>. Of course it's a bit silly to create a tall table out of a file that fits in memory, but this illustrates how to work with a much larger table.<\/p><p>Instead of calling <tt>readtable<\/tt>, I'll start by calling <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/datastore.html\">datastore<\/a><\/tt> to read <tt>outages.csv<\/tt> into a datastore.<\/p><p>Then I'll call <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/tall.html\">tall<\/a><\/tt> on <tt>ds<\/tt>, to create a tall table out of the datastore. Now I have a tall table that looks and acts very much like the table I worked with in the previous sections. One difference you see is that the tall table is displayed as an M-by-6 table, showing that the number of rows is not yet known.<\/p><pre class=\"codeinput\">ds = datastore(<span class=\"string\">'outages.csv'<\/span>);\r\nT = tall(ds)\r\n<\/pre><pre class=\"codeoutput\">Starting parallel pool (parpool) using the 'local' profile ...\r\nconnected to 2 workers.\r\nT =\r\n  M&times;6 tall table\r\n      Region          OutageTime        Loss     Customers     RestorationTime           Cause      \r\n    ___________    ________________    ______    __________    ________________    _________________\r\n    'SouthWest'    2002-02-01 12:18    458.98    1.8202e+06    2002-02-07 16:50    'winter storm'   \r\n    'SouthEast'    2003-01-23 00:49    530.14    2.1204e+05    NaT                 'winter storm'   \r\n    'SouthEast'    2003-02-07 21:15     289.4    1.4294e+05    2003-02-17 08:14    'winter storm'   \r\n    'West'         2004-04-06 05:44    434.81    3.4037e+05    2004-04-06 06:10    'equipment fault'\r\n    'MidWest'      2002-03-16 06:18    186.44    2.1275e+05    2002-03-18 23:23    'severe storm'   \r\n    'West'         2003-06-18 02:49         0             0    2003-06-18 10:54    'attack'         \r\n    'West'         2004-06-20 14:39    231.29           NaN    2004-06-20 19:16    'equipment fault'\r\n    'West'         2002-06-06 19:28    311.86           NaN    2002-06-07 00:51    'equipment fault'\r\n    :              :                   :         :             :                   :\r\n    :              :                   :         :             :                   :\r\n<\/pre><p>Now I can use <tt>findgroups<\/tt> and <tt>splitapply<\/tt> to perform some of the same calculations as before. Again I will wrap <tt>mean<\/tt> in an anonymous function so I can ignore <tt>NaN<\/tt> values, as I did above.<\/p><pre class=\"codeinput\">omean = @(x) mean(x,<span class=\"string\">'omitnan'<\/span>);\r\nT.Region = categorical(T.Region);\r\nT.Cause = categorical(T.Cause);\r\nG = findgroups(T.Region);\r\nmeanLoss = splitapply(omean,T.Loss,G);\r\nmeanCustomers = splitapply(omean,T.Customers,G)\r\n<\/pre><pre class=\"codeoutput\">meanCustomers =\r\n  M&times;1 tall double column vector\r\n    ?\r\n    ?\r\n    ?\r\n    :\r\n    :\r\n<\/pre><p>Instead of seeing the output, we see a trail of question marks. What does this mean? Simply that I have not performed any calculations yet. At this point, I have only defined the calculations I want to perform on the tall table. No calculations are performed until I <i>gather<\/i> the results, by calling the <tt><a href=\"https:\/\/www.mathworks.com\/help\/matlab\/ref\/gather.html\">gather<\/a><\/tt> function. The output of <tt>gather<\/tt> is not a tall array and must fit into memory.<\/p><pre class=\"codeinput\">meanLoss = gather(meanLoss);\r\nmeanCustomers = gather(meanCustomers)\r\n<\/pre><pre class=\"codeoutput\">Evaluating tall expression using the Parallel Pool 'local':\r\n- Pass 1 of 4: Completed in 4 sec\r\n- Pass 2 of 4: Completed in 2 sec\r\n- Pass 3 of 4: Completed in 2 sec\r\n- Pass 4 of 4: Completed in 2 sec\r\nEvaluation completed in 25 sec\r\nEvaluating tall expression using the Parallel Pool 'local':\r\n- Pass 1 of 1: Completed in 2 sec\r\nEvaluation completed in 5 sec\r\nmeanCustomers =\r\n   2.4015e+05\r\n   1.4917e+05\r\n   1.6776e+05\r\n   2.6975e+05\r\n   1.5201e+05\r\n<\/pre><p>Now let's tabulate the results in a table. Since <tt>meanLoss<\/tt> and <tt>meanCustomers<\/tt> are not tall arrays, <tt>results<\/tt> is not a tall table. Add the regions as another table variable.<\/p><pre class=\"codeinput\">results = table(meanLoss,meanCustomers)\r\nregion = unique(T.Region);\r\nresults.region = gather(region)\r\n<\/pre><pre class=\"codeoutput\">results =\r\n  5&times;2 table\r\n    meanLoss    meanCustomers\r\n    ________    _____________\r\n    1137.7      2.4015e+05   \r\n    551.65      1.4917e+05   \r\n    495.35      1.6776e+05   \r\n    493.88      2.6975e+05   \r\n    433.37      1.5201e+05   \r\nEvaluating tall expression using the Parallel Pool 'local':\r\n- Pass 1 of 1: Completed in 1 sec\r\nEvaluation completed in 2 sec\r\nresults =\r\n  5&times;3 table\r\n    meanLoss    meanCustomers     region  \r\n    ________    _____________    _________\r\n    1137.7      2.4015e+05       MidWest  \r\n    551.65      1.4917e+05       NorthEast\r\n    495.35      1.6776e+05       SouthEast\r\n    493.88      2.6975e+05       SouthWest\r\n    433.37      1.5201e+05       West     \r\n<\/pre><h4>Turning the Tables on You<a name=\"f926ecee-77fd-435f-b9c5-a61001a1cdbd\"><\/a><\/h4><p>Here I have shown you several ways you can group data in tables for calculations. To summarize, you can use one or more grouping variables to specify groups within the other table variables. Then you can apply functions to the groups in the table variables.<\/p><p>To apply:<\/p><div><ul><li>A function to a single table variable and return an array, use <tt>accumarray<\/tt>.<\/li><li>The same function to each table variable and return a table, use <tt>varfun<\/tt>.<\/li><li>A function that requires all the table variables as input arguments and return a table, use <tt>rowfun<\/tt>.<\/li><li>Different functions to different table variables and build a table of results, use <tt>findgroups<\/tt> and <tt>splitapply<\/tt>.<\/li><\/ul><\/div><p>How about you? Have done analysis with tables where <tt>accumarray<\/tt>, <tt>varfun<\/tt>, <tt>rowfun<\/tt>, <tt>findgroups<\/tt> and <tt>splitapply<\/tt> might have helped? Do you have cases where you need tall tables? Let us know <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=2365#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_071ed87a60174cbb840b0d2516dd71c8() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='071ed87a60174cbb840b0d2516dd71c8 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 071ed87a60174cbb840b0d2516dd71c8';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2017 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_071ed87a60174cbb840b0d2516dd71c8()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2017a<br><\/p><\/div><!--\r\n071ed87a60174cbb840b0d2516dd71c8 ##### SOURCE BEGIN #####\r\n%% Ways to Perform Calculations on Groups of Data in Tables\r\n% Today I'd like to introduce a guest blogger, Stephen Doe, who works for\r\n% the MATLAB Documentation team here at MathWorks. Stephen will discuss\r\n% different methods for grouping data, and performing calculations on those\r\n% groups, using <https:\/\/www.mathworks.com\/help\/matlab\/tables.html Tables>.\r\n%\r\n%% Create Table\r\n% Tables are convenient containers for column-oriented data. The variables\r\n% in a table can have different data types, but must have the same number\r\n% of rows. You can access table data by row, by variable, or by variable\r\n% name. And you can specify groups within table variables, to perform\r\n% calculations on those groups. Today I will show several different ways to\r\n% apply functions to groups in tables.\r\n%\r\n% First, let's create a table. You can create a table using the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/table.html table>| function,\r\n% the <https:\/\/www.mathworks.com\/help\/matlab\/ref\/importtool-app.html Import\r\n% Tool>, or the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/readtable.html\r\n% readtable>| function. I'm going to use |readtable| to read data from a\r\n% sample file that ships with MATLAB. The file |outages.csv| contains\r\n% simulated data for electric power outages over a period of 12 years in\r\n% the United States. Read the table. To display the first five rows, use\r\n% table indexing.\r\n\r\nT = readtable('outages.csv','Format','%C%D%f%f%D%C');\r\nT(1:5,:)\r\n\r\n%%\r\n% There are six columns in the file. Since I want to specify the data types\r\n% of the table variables, I use the |Format| name-value pair argument. |T|\r\n% has two categorical variables (specified with |%C|), two datetime\r\n% variables (|%D|) and two numeric variables (|%f|).\r\n\r\n%% Perform Calculation on Each Numeric Variable\r\n% Once data is in a table, you can perform calculations on different\r\n% table variables. One way is to use dot notation to refer to table\r\n% variables by name and perform calculations. For example, calculate the\r\n% total number of customers affected by power outages using the |sum|\r\n% function, omitting any |NaN| values that might be in |T.Customers|.\r\ntotalCustomers = sum(T.Customers,'omitnan')\r\n\r\n%%\r\n% You also can apply a function to every table variable, using the |varfun|\r\n% function. |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/varfun.html\r\n% varfun>| is particularly useful for applying a function and returning the\r\n% results in another table.\r\n%\r\n% However, you might want to apply a function only to variables of a\r\n% particular type. You can specify the type of variable with the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/vartype.html vartype>| \r\n% function. Here I'm going to subscript into |T| for the numeric variables\r\n% and calculate the mean of each one. And since the variables have some\r\n% |NaN| values, I'm going to wrap |mean| in an anonymous function so I can\r\n% use the |'omitnan'| flag.\r\nT2 = T(:,vartype('numeric'));\r\nomean = @(x) mean(x,'omitnan');\r\nmeanT = varfun(omean,T2)\r\n\r\n%% Perform Calculation on Groups in Numeric Variables\r\n% A more interesting view of the table is to group the data, and start\r\n% looking for differences between the groups. For example, what is the\r\n% difference in the mean power loss between the Northeast and Southwest\r\n% regions?\r\n%\r\n% To answer this question, you can start by specifying one or more table\r\n% variables as _grouping variables_. The grouping variables define groups\r\n% into which you split the other table variables. Then you can apply a\r\n% function to each group within each table variable and the results. For\r\n% more on grouping variables, see\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/grouping-variables-for-splitting-data.html\r\n% Grouping Variables to Split Data>.\r\n%\r\n% You can use grouping variables with the |varfun| function I described\r\n% above. Specify |Region| as the grouping variable in this table. It's a\r\n% categorical variable, making it especially convenient to use as a\r\n% grouping variable. Specify |Loss| and |Customers| as the input variables,\r\n% and calculate the mean values by region for these two variables. I'm\r\n% going to use the same |omean| function I specified above.\r\nmeanByRegion = varfun(omean,T,'GroupingVariables','Region',...\r\n                      'InputVariables',{'Loss','Customers'})\r\n%%\r\n% Another way to apply a grouping variable to the data in a table variable\r\n% is to use the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/accumarray.html\r\n% accumarray>| function. If I convert |T.Region| to a numeric array, and\r\n% extract |T.Loss| from |T|, then I can calculate mean power loss per\r\n% region using |accumarray|, just as I did with |varfun|.\r\nregions = double(T.Region);\r\nA = accumarray(regions,T.Loss,[],omean)\r\n\r\n%%\r\n% |accumarray| operates on arrays. So to use |accumarray| on a table\r\n% variable, you must extract it from the table. Also, the grouping variable\r\n% must be numeric, and the output argument is an array, not a table. While\r\n% you can use |accumarray| on table variables, it lacks many of the\r\n% conveniences that |varfun| provides.\r\n%\r\n% You can specify more than one grouping variable using |varfun|. For\r\n% example, I'll specify |Region| and |Cause| as grouping variables. The\r\n% groups are defined as combinations of regions and causes.\r\nmeanByRegionAndCause = varfun(omean,T,'GroupingVariables',{'Region','Cause'},...\r\n                              'InputVariables',{'Loss','Customers'});\r\nmeanByRegionAndCause(1:5,:)\r\n\r\n%% Multiple Variables as Input Arguments to Function\r\n% Another simple calculation is the mean duration of the power outages by\r\n% region. The basis of this calculation is the difference between the\r\n% |OutageTime| and |RestorationTime| table variables.\r\n%\r\n% You might think to use |varfun| for this calculation too. The difficulty\r\n% is that |varfun| applies a function to each table variable separately.\r\n% But this calculation requires both variables. In other words, we want to\r\n% apply a function that requires two input arguments.\r\n% \r\n% The solution is to use the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/rowfun.html rowfun>| function.\r\n% With |rowfun|, all the table variables are taken as input arguments. If\r\n% the table has |N| variables, then the function you apply must accept |N|\r\n% input arguments.\r\n%\r\n% To perform this calculation, I'll start by defining an anonymous function\r\n% that subtracts one input argument from the other, and then calculates the\r\n% mean. Then I will calculate the mean duration of power outages by region\r\n% using |rowfun|.\r\nmeanDiff = @(a,b) mean(abs(a-b),'omitnan');\r\nmeanOutageTimes = rowfun(meanDiff,T,'GroupingVariables','Region',...\r\n                         'InputVariables',{'OutageTime','RestorationTime'})\r\n\r\n%%\r\n% Note that |OutageTime| and |RestorationTime| are datetime variables.\r\n% You can perform arithmetic with datetime variables, just as you can\r\n% with numeric variables.\r\n%\r\n% To rename a variable, you can access the |VariableNames| property of the\r\n% table and change its name there. Tables contain metadata, such as\r\n% variable names and descriptions, in a property called |Properties|.\r\nmeanOutageTimes.Properties.VariableNames{'Var3'} = 'mean_OutageTime';\r\nmeanOutageTimes(1:5,:)\r\n\r\n%% Use Groups in Multiple Calculations and Tabulate Results\r\n% |varfun| and |rowfun| are suitable when you want to perform the same\r\n% calculation with the table variables. But what if you want to calculate\r\n% different quantities? For example, you might want to calculate the mean\r\n% power loss, but also the minimum duration of power outages, by region. In\r\n% that case, you can perform the calculations and tabulate the results\r\n% using the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/findgroups.html\r\n% findgroups>| and\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/splitapply.html splitapply>|\r\n% functions.\r\n%\r\n% These functions work a little differently from |varfun| and |rowfun|. The\r\n% |findgroups| function returns numeric indices that correspond to the\r\n% grouping variables you specify. As a second output, it also returns a\r\n% table of the groups. Then you can call |splitapply| to apply a function\r\n% to each group within the variables.\r\n%\r\n% For example, I'll calculate the mean power loss, the minimum outage time, and\r\n% the maximum number of customers affected, by region. I will specify the groups\r\n% once using |findgroups|. Then I can use the groups in multiple calls to\r\n% |splitapply|.\r\n%\r\n% I'm also going to switch gears a little bit, by removing all rows with\r\n% |NaN| or missing values. I'll use the\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/rmmissing.html rmmissing>|\r\n% function to remove these rows, so that I don't have to specify the\r\n% |'omitnan'| flag in every function call.\r\nT = rmmissing(T);\r\n[G,results] = findgroups(T(:,'Region'));\r\nmeanLoss = splitapply(@mean,T.Loss,G);\r\nminOutageTime = splitapply(@min,T.RestorationTime - T.OutageTime,G);\r\nmaxCustomers = splitapply(@max,T.Customers,G)\r\n\r\n%%\r\n% The output of |splitapply| is an array. |maxCustomers| is a 5-by-1\r\n% numeric array. However, you can tabulate the results by adding them as\r\n% variables to the table |results|, the second output from |findgroups|.\r\nresults.meanLoss = meanLoss;\r\nresults.minOutageTime = minOutageTime;\r\nresults.maxCustomers = maxCustomers\r\n\r\n%% Perform Calculations on Tall Table\r\n% So far, I've worked with a table that is small enough to fit into memory.\r\n% But what if there is so much data that it won't all fit into memory? I\r\n% might still want to treat it as a table and use the same functions.\r\n%\r\n% That can easily be done by creating a _tall table_. A tall table is a\r\n% kind of a tall array in MATLAB. Tall arrays and tall tables know how to\r\n% read the data in one chunk at a time, perform the calculations you want,\r\n% and then gather up the output at the end. The syntax for working with a\r\n% tall table is very similar to that for working with a table.\r\n%\r\n% For more information on tall arrays and tall tables, see\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/tall-arrays.html Tall Arrays>.\r\n%\r\n% To show how this works, I'll create a tall table out of |outages.csv|. Of\r\n% course it's a bit silly to create a tall table out of a file that fits in\r\n% memory, but this illustrates how to work with a much larger table.\r\n%\r\n% Instead of calling |readtable|, I'll start by calling\r\n% |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/datastore.html datastore>| to\r\n% read |outages.csv| into a datastore. \r\n%\r\n% Then I'll call |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/tall.html\r\n% tall>| on |ds|, to create a tall table out of the datastore. Now I have a\r\n% tall table that looks and acts very much like the table I worked with in\r\n% the previous sections. One difference you see is that the tall table is\r\n% displayed as an M-by-6 table, showing that the number of rows is not yet\r\n% known.\r\nds = datastore('outages.csv');\r\nT = tall(ds)\r\n\r\n%%\r\n% Now I can use |findgroups| and |splitapply| to perform some of the same\r\n% calculations as before. Again I will wrap |mean| in an anonymous function\r\n% so I can ignore |NaN| values, as I did above.\r\n%\r\nomean = @(x) mean(x,'omitnan');\r\nT.Region = categorical(T.Region);\r\nT.Cause = categorical(T.Cause);\r\nG = findgroups(T.Region);\r\nmeanLoss = splitapply(omean,T.Loss,G);\r\nmeanCustomers = splitapply(omean,T.Customers,G)\r\n\r\n%% \r\n% Instead of seeing the output, we see a trail of question marks. What does\r\n% this mean? Simply that I have not performed any calculations yet. At this\r\n% point, I have only defined the calculations I want to perform on the tall\r\n% table. No calculations are performed until I _gather_ the results, by\r\n% calling the |<https:\/\/www.mathworks.com\/help\/matlab\/ref\/gather.html\r\n% gather>| function. The output of |gather| is not a tall array and must fit\r\n% into memory.\r\n\r\nmeanLoss = gather(meanLoss);\r\nmeanCustomers = gather(meanCustomers)\r\n\r\n%%\r\n% Now let's tabulate the results in a table. Since |meanLoss| and\r\n% |meanCustomers| are not tall arrays, |results| is not a tall table. Add\r\n% the regions as another table variable.\r\nresults = table(meanLoss,meanCustomers)\r\nregion = unique(T.Region);\r\nresults.region = gather(region)\r\n\r\n%% Turning the Tables on You\r\n% Here I have shown you several ways you can group data in tables for\r\n% calculations. To summarize, you can use one or more grouping variables to\r\n% specify groups within the other table variables. Then you can apply\r\n% functions to the groups in the table variables.\r\n%\r\n% To apply:\r\n%\r\n% * A function to a single table variable and return an array, use\r\n% |accumarray|.\r\n% * The same function to each table variable and return a table, use |varfun|.\r\n% * A function that requires all the table variables as input\r\n% arguments and return a table, use |rowfun|.\r\n% * Different functions to different table variables and build a\r\n% table of results, use |findgroups| and |splitapply|.\r\n%\r\n% How about you? Have done analysis with tables where |accumarray|,\r\n% |varfun|, |rowfun|, |findgroups| and |splitapply| might have helped? Do\r\n% you have cases where you need tall tables? Let us know\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=2365#respond here>.\r\n\r\n##### SOURCE END ##### 071ed87a60174cbb840b0d2516dd71c8\r\n-->","protected":false},"excerpt":{"rendered":"<!--introduction--><p>Today I'd like to introduce a guest blogger, Stephen Doe, who works for the MATLAB Documentation team here at MathWorks. Stephen will discuss different methods for grouping data, and performing calculations on those groups, using <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/tables.html\">Tables<\/a>.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2017\/08\/09\/ways-to-perform-calculations-on-groups-of-data-in-tables\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[57,48],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2365"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=2365"}],"version-history":[{"count":4,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2365\/revisions"}],"predecessor-version":[{"id":2385,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/2365\/revisions\/2385"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=2365"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=2365"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=2365"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}