Debugging Grouped Operations
Today's guest post comes from Sean de Wolski, one of Loren's fellow Application Engineers. You might recognize him from MATLAB answers and the pick of the week blog!
One of my colleagues approached me last month and asked for help debugging an error with splitapply. Splitapply takes group information and applies a function to each group in the data (sort of like a pivot table). Note, that everything here also applies to the lower level but more powerful function accumarray.
The documentation provides numerous simple examples for what splitapply does so check them out if you're not familiar with it.
Here is an anonymized version of the data set my colleague had. The first three variables are categorical identifiers and the fourth are some data associated with them.
load('Bears.mat')
disp(Bears)
Candy Animal Sports Bytes ________ __________ ___________ ______________________________________ Cinnamon Black Brown -0.8095 0.40391 2 Cinnamon Polar Brown -2.9443 0.096455 3 Cinnamon Polar Brown 1.4384 0.13197 9 Cinnamon Sloth Chicago 0.32519 0.94205 1 Cinnamon Sloth Chicago -0.75493 0.95613 5 Cinnamon Sloth Chicago 1.3703 0.57521 2 Cinnamon Sloth Chicago -1.7115 0.05978 10 Cinnamon Sloth Chicago -0.10224 0.23478 8 Cinnamon Sun Baylor -0.24145 0.35316 6 Cinnamon Sun Baylor 0.31921 0.82119 5 Cinnamon Polar Brown 0.31286 0.015403 1 Cinnamon Polar Brown -0.86488 0.043024 7 Cinnamon Black Brown -0.030051 0.16899 1 Cinnamon Black Brown -0.16488 0.64912 1 Cinnamon Polar Brown 0.62771 0.73172 6 Cinnamon Polar Brown 1.0933 0.64775 1 Cinnamon Spectacled Coast Guard 1.1093 0.45092 9 Cinnamon Sloth Chicago -0.86365 0.54701 9 Gummy Sun Baylor 0.077359 0.29632 8 Gummy Sun Baylor -1.2141 0.74469 2 Gummy Sun Baylor -1.1135 0.18896 7 Gummy Sun Baylor -0.0068493 0.68678 6 Gummy Sun Baylor 1.5326 0.18351 10 Gummy Sloth Chicago -0.76967 0.36848 7 Gummy Sloth Chicago 0.37138 0.62562 9 Gummy Spectacled Coast Guard -0.22558 0.78023 5 Gummy Spectacled Coast Guard 1.1174 0.081126 5 Gummy Spectacled Coast Guard -1.0891 0.92939 9 Gummy Sloth Chicago 0.032557 0.77571 1 Gummy Sloth Chicago 0.55253 0.48679 2 Gummy Spectacled Coast Guard 1.1006 0.43586 2 Gummy Spectacled Coast Guard 1.5442 0.44678 4 Gummy Spectacled Coast Guard 0.085931 0.30635 9 Gummy Spectacled Coast Guard -1.4916 0.50851 9 Gummy Spectacled Coast Guard -0.7423 0.51077 1 Gummy Spectacled Coast Guard -1.0616 0.81763 4 Gummy Spectacled Coast Guard 2.3505 0.79483 6 Gummy Sloth Chicago -0.6156 0.64432 5 Gummy Spectacled Coast Guard 0.74808 0.37861 7 Gummy Sloth Chicago -0.19242 0.81158 7 Gummy Spectacled Coast Guard 0.88861 0.53283 3 Gummy Spectacled Coast Guard -0.76485 0.35073 5 Gummy Spectacled Coast Guard -1.4023 0.939 1 Gummy Black Brown -1.4224 0.87594 10 Gummy Sloth Chicago 0.48819 0.55016 2 Gummy Sloth Chicago -0.17738 0.62248 2 Gummy Sloth Chicago -0.19605 0.58704 4 Gummy Sloth Chicago 1.4193 0.20774 2 Gummy Sloth Chicago 0.29158 0.30125 5 Gummy Sloth Chicago 0.19781 0.47092 4 Gummy Polar Brown 1.5877 0.23049 10 Gummy Polar Brown -0.80447 0.84431 10 Gummy Sloth Chicago 0.69662 0.19476 1 Gummy Black Brown 0.83509 0.22592 8 Gummy Black Brown -0.24372 0.17071 3 Gummy Sloth Chicago 0.21567 0.22766 5 Gummy Black Brown -1.1658 0.4357 6 Gummy Sloth Chicago -1.148 0.3111 10 Gummy Sloth Chicago 0.10487 0.92338 5 Gummy Sloth Chicago 0.72225 0.43021 10 Gummy Sloth Chicago 2.5855 0.18482 4 Gummy Sloth Chicago -0.66689 0.90488 8 Gummy Sloth Chicago 0.18733 0.97975 7 Gummy Sloth Chicago -0.082494 0.43887 6 Gummy Sloth Chicago -1.933 0.11112 7 Gummy Sloth Chicago -0.43897 0.25806 7 Gummy Sloth Chicago -1.7947 0.40872 2 Gummy Sloth Chicago 0.84038 0.5949 2 Gummy Sloth Chicago -0.88803 0.26221 10 Gummy Sloth Chicago 0.10009 0.60284 2 Gummy Sloth Chicago -0.54453 0.71122 1 Gummy Sloth Chicago 0.30352 0.22175 6 Gummy Sloth Chicago -0.60033 0.11742 9 Gummy Sloth Chicago 0.48997 0.29668 7 Gummy Sloth Chicago 0.73936 0.31878 2 Gummy Sloth Chicago 1.7119 0.42417 4 Gummy Sloth Chicago -0.19412 0.50786 5 Gummy Sloth Chicago -2.1384 0.085516 10 Gummy Sloth Chicago -0.83959 0.26248 2 Gummy Sloth Chicago 1.3546 0.80101 9 Gummy Sloth Chicago -1.0722 0.02922 7 Gummy Sloth Chicago 0.96095 0.92885 4 Gummy Sloth Chicago 0.12405 0.73033 2 Gummy Sloth Chicago 1.4367 0.48861 5 Gummy Sloth Chicago -1.9609 0.57853 5 Gummy Black Brown -0.1977 0.23728 2 Gummy Sloth Chicago -1.2078 0.45885 6 Gummy Sloth Chicago 2.908 0.96309 3 Gummy Sloth Chicago 0.82522 0.54681 4 Gummy Sloth Chicago 1.379 0.52114 6 Gummy Sloth Chicago -1.0582 0.23159 3 Gummy Sloth Chicago -0.46862 0.4889 3 Gummy Black Brown -0.27247 0.62406 7 Gummy Polar Brown 1.0984 0.67914 3 Gummy Sloth Chicago -0.27787 0.39552 9 Gummy Sloth Chicago 0.70154 0.36744 10 Gummy Sloth Chicago -2.0518 0.98798 8 Gummy Black Brown -0.35385 0.037739 4 Gummy Sloth Chicago -0.82359 0.88517 6 Gummy Sloth Chicago -1.5771 0.91329 2 Gummy Sloth Chicago 0.50797 0.79618 10 Gummy Sloth Chicago 0.28198 0.098712 9 Gummy Sloth Chicago 0.03348 0.26187 9 Gummy Sloth Chicago -1.3337 0.33536 3 Gummy Sloth Chicago 1.1275 0.67973 6 Gummy Sloth Chicago 0.35018 0.13655 1 Gummy Sloth Chicago -0.29907 0.72123 5 Gummy Sloth Chicago 0.02289 0.10676 4 Gummy Sloth Chicago -0.262 0.65376 2 Gummy Sloth Chicago -1.7502 0.49417 2 Gummy Sloth Chicago -0.28565 0.77905 5 Gummy Black Brown -0.83137 0.71504 1 Gummy Sloth Chicago -0.97921 0.90372 6 Gummy Sloth Chicago -1.1564 0.89092 5 Gummy Sloth Chicago -0.53356 0.33416 7 Gummy Sloth Chicago -2.0026 0.69875 7 Gummy Sloth Chicago 0.96423 0.19781 7 Gummy Sloth Chicago 0.52006 0.030541 1 Gummy Sloth Chicago -0.020028 0.74407 1 Gummy Sloth Chicago -0.034771 0.50002 4 Gummy Sloth Chicago -0.79816 0.47992 6 Gummy Sloth Chicago 1.0187 0.90472 7 Gummy Sloth Chicago -0.13322 0.60987 5 Gummy Sloth Chicago -0.71453 0.61767 9 Gummy Sloth Chicago 1.3514 0.85944 8 Gummy Sloth Chicago -0.22477 0.80549 10 Cinnamon Polar Brown -0.58903 0.57672 6
The operation he was trying to calculate was the nan-omitted mean of Bytes based on two of the categories.
[animalcandy, animal, candy] = findgroups(Bears.Animal,Bears.Candy);
meanbyte = splitapply(@(x)mean(x, 'omitnan'), Bears.Bytes, animalcandy);
Error using vertcat Dimensions of arrays being concatenated are not consistent. Error in splitapply>localapply (line 257) finalOut{curVar} = vertcat(funOut{:,curVar}); Error in splitapply (line 132) varargout = localapply(fun,splitData,gdim,nargout); Error in mainDebuggingGroupedOps (line 32) meanbyte = splitapply(@(x)mean(x, 'omitnan'), Bears.Bytes, animalcandy);
Hmm, I've seen that error before, but what does it have to do with this? How do we debug this? One could put a break point at the anonymous function @(x)mean(x, 'omitnan') and then step with the debugger until the error occurs.
This would likely work for a small number of groups, but as the number of groups gets larger, it would be lots of steps, one for each function evaluation. You'd also likely have to do it twice, a second time after the error occurs. Setting the debugger to stop on errors may work as well for splitapply but not for accumarray which is builtin and even with splitapply may not stop you in a useful spot.
A trick I like to use is to just replace the function handle with {}. This takes whatever is provided and packs it into a cell so you can see exactly what is being passed into each function evaluation for each group.
bytecell = splitapply(@(x){x}, Bears.Bytes, animalcandy); disp(bytecell)
[14×3 double] [ 1×3 double] [ 5×3 double] [ 2×3 double] [78×3 double] [ 6×3 double] [ 8×3 double] [ 3×3 double] [ 3×3 double] [ 7×3 double]
From here we can see that the second cell has only one row. Since mean takes the mean of the first non-singleton dimension, it's reducing this to a scalar by taking the mean of the row where the rest of the elements are coming rows from taking the mean of columns. A scalar can't concatenate with a matrix so we get the error.
The fix for this is simple, pass in the dimension to mean to force it to always take column mean. Then rebuild the table with the labels.
meanbyte = splitapply(@(x)mean(x, 1, 'omitnan'), Bears.Bytes, animalcandy);
disp(table(animal, candy, meanbyte))
animal candy meanbyte __________ ________ ___________________________________ Spectacled Gummy 0.075572 0.55805 5 Spectacled Cinnamon 1.1093 0.45092 9 Sun Gummy -0.1449 0.42005 6.6 Sun Cinnamon 0.03888 0.58718 5.5 Sloth Gummy -0.081113 0.51523 5.2949 Sloth Cinnamon -0.28948 0.55249 5.8333 Black Gummy -0.45653 0.4153 5.125 Black Cinnamon -0.33481 0.40734 1.3333 Polar Gummy 0.62722 0.58464 7.6667 Polar Cinnamon -0.13228 0.32043 4.7143
In this case, the fix was fairly discernible from a quick inspection. If it was not, we could loop over the cell and evaluate the function on each element to see where the error occurs. If the error occurs on a specific cells' data, the loop will stop there and we can investigate. If it's on the concatenate step, that'll be obvious at the end.
fun = @(x)mean(x, 'omitnan'); meanbytecell = cell(size(bytecell)); for ii = 1:numel(bytecell) meanbytecell{ii} = fun(bytecell{ii}); end disp(meanbytecell)
[1×3 double] [ 3.5201] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double] [1×3 double]
And now it is obvious why these won't concatenate.
I find looping over the cell in this manner to be much easier than looping over the original data set and trying to identify which elements are in which groups and indexing correctly.
I'm also a big fan of using splitapply/accumarray with cell output for making objects or plots based on grouped data where the object can't be returned directly. Continuing this example we'll use a histogram for each group of original data, wrapping histogram in {}.
figure axes('ColorOrder', parula(numel(animal))) hold on h = splitapply(@(x){histogram(x)}, Bears.Bytes, animalcandy); legend([h{:}], compose("%s/%s", animal, candy));
On an aside, development has been working to make grouped operations easier over the last few releases with a collection of new functions:
- varfun - Apply function to table input variables based on grouping ones
- groupsummary - Group summary statistics
- grouptransform - Transform based on groups
- retime - Aggregate based on time
Doing this same operation with varfun would look like this:
meanbytetable = varfun(@(x)mean(x, 1, 'omitnan'), Bears, ... 'GroupingVariables', {'Animal', 'Candy'}, ... 'InputVariables', {'Bytes'}); disp(meanbytetable)
Animal Candy GroupCount Fun_Bytes __________ ________ __________ ___________________________________ Spectacled Gummy 14 0.075572 0.55805 5 Spectacled Cinnamon 1 1.1093 0.45092 9 Sun Gummy 5 -0.1449 0.42005 6.6 Sun Cinnamon 2 0.03888 0.58718 5.5 Sloth Gummy 78 -0.081113 0.51523 5.2949 Sloth Cinnamon 6 -0.28948 0.55249 5.8333 Black Gummy 8 -0.45653 0.4153 5.125 Black Cinnamon 3 -0.33481 0.40734 1.3333 Polar Gummy 3 0.62722 0.58464 7.6667 Polar Cinnamon 7 -0.13228 0.32043 4.7143
Do you work with grouped data functions? Let us know here.
- Category:
- Best Practice,
- Common Errors,
- Tool