Loren on the Art of MATLAB

Under-appreciated accumarray 59

Posted by Loren Shure,

The MATLAB function accumarray seems to be under-appreciated. accumarray allows you to aggregate items in an array in the way that you specify.

Contents

Newsgroup Statistics

Since accumarray has been in MATLAB (7.0, R14), there have been over 100 threads in the MATLAB newsgroup where accumarray arose as a solution.

Recent Questions

One of the more recent threads asks how to aggregate values in one list based on another list. Suppose the lists are

group = [1 2 2 2 3 3]'
data = [6 43 3 4 2 5]'
group =
     1
     2
     2
     2
     3
     3
data =
     6
    43
     3
     4
     2
     5

and the goal is to sum the data in each group. Let's first create the first input argument. accumarray wants the an array of subscripts of the data pertaining to which output value the data belongs to. Since we're just producing a column vector with 3 values, we just append a column of ones to the group vector.

indices = [group ones(size(group))]
indices =
     1     1
     2     1
     2     1
     2     1
     3     1
     3     1

Next We Accumulate

Since the default function for accumulation is sum, we can use the simplest form of accumarray to get the desired results.

sums = accumarray(indices, data)
sums =
     6
    50
     7

Another Way to Accumulate

We can instead accumulate the results by adding 2 input arguments to the function call. These are the a size vector for the output array and a function handle specifying the accumulating function.

sums1 = accumarray(indices, data, [numel(unique(group)) 1], @sum)
sums1 =
     6
    50
     7

It's easy to see that the results from the two function calls are the same.

isequal(sums, sums1)
ans =
     1

Other Accumulation Functions

Sometimes, summing the results isn't what I'm looking for. Having puzzled out the 4 input call syntax, I can now simply replace the accumulation function. To find the maximum values in each group, I use this code.

maxData = accumarray(indices, data, [numel(unique(group)) 1], @max)
maxData =
     6
    43
     5
maxData = accumarray(indices, data, [numel(unique(group)) 1], ...
    @(x)~any(isfinite(x)))
maxData =
     0
     0
     0
data(end) = Inf
maxData = accumarray(indices, data, [numel(unique(group)) 1], ...
    @(x)~any(isfinite(x)))
data =
     6
    43
     3
     4
     2
   Inf
maxData =
     0
     0
     0
maxData = accumarray(indices, data, [numel(unique(group)) 1], ...
    @(x)all(isfinite(x)))
maxData =
     1
     1
     0

Derivative Work

John D'Errico made a more general function consolidator, found on the MathWorks File Exchange to allow you to do some extra aggregation. For example, consolidator allows the aggregation of elements when they are within a specified tolerance and not just identical.

Do You accum?

Some other obvious accumulation functions you might use include sum, max, min, prod. What functions do you use in situations when you aggregate with accumarray? Let me know here.


Get the MATLAB code

Published with MATLAB® 7.5

59 CommentsOldest to Newest

yesterday i build a matrix of ones and zeros and multiplied by it to preform accumulation. accumarray seems a better solution but the first option also works if data has multiple colomns.

how can I use accumarray if data has multiple colomns?

Thanks,
Dani.

Hello Loren,
regarding my last post,
tried using anonymous function:

f= @(x) accumarray(indices,data(:,x));
sum=arrayfun(f,1:size(data,2),’UniformOutput’,'false’);

but it is not working,
Dani.

Hi.
This was a really nifty trick I didn’t know about. Thank you and keep them coming.

However, it did take me a few minutes and some studying of the documentation to realize that the line “indices = [group ones(size(group))]” only serves to produce more visually pleasant output. On my first read I thought there was some deep and important meaning to that second column.

Just wanted to mention that if the next reader gets confused as well.

Sincerely
Daniel Armyr

Thanks for the comments, folks.

Dani,

I recommend you look at the reference page for accumarray. There are many examples there, including ones with matrices and not just vectors.

–Loren

This is another example of very poor documentation. I’ve read the FRP for “accumarry” several times now as well as you column and I still can’t figure out what the function is doing.

Since the documentation has this brief comment: “accumarray sums values from val using the default behavior of sum” that the accumulating and aggregating you are referring to is actually summing. If so, you need to clearly say so.

Countering my above suspicion are the 3 syntax options that allow the specification of an alternate function. So, what does it mean to accumulate on one hand using “sum” and on the other using “sin?” Is it effectively doing something like “sum(sin(x))?” In your column, you provided 4 examples of substituting alternate functions: “max,”, “any” and “all.” But, I’m not following any of them.

Most importantly, I can’t follow the data flow to understand how the vector of indices controls the accumulation. The first example on the FRP does provide some help in this. But, how would one ever construct a meaningful or useful index vector?

The FRP and your column don’t show us a problem where this function is really useful. The way it is currently explained, it appears to be a solution in search of a problem. I’ve similarly criticized function handles. It took me the better part of a year to understand what they did and I still can’t imagine a case where they would be useful since they seem to obscure the data flow.

Finally, with 6 syntax options, the FRP requires in excess of 20 examples that systematically lead the user from trivial to sophisticated for each of syntax options.

Oliver-

You can’t use sin with accumarray. The reason is stated in this part of the description:

“A = accumarray(subs,val,sz,fun) applies function fun to each subset of elements of val. You must specify the fun input using the @ symbol (e.g., @sin). The function fun must accept a column vector and return a numeric, logical, or character scalar, or a scalar cell.”

The problems accumarray helps to solve are ones posed by users such as the examples from the newsgroup — where they want to aggregate contents in a collection subject to some criteria that they have for their particular problem. That’s why hist alone was not enough.

Nonetheless, I do hear that you are unhappy with the documentation.

–Loren

Loren,

It appears we aren’t communicating well.

You say that you can’t use sin with accumarray and then you quote the documentation where they do exactly that. So, clearly, accumarray can be used with sin. But, I do not understand the data flow. e.g., what the details of this accumulation is.

I’ve pretty much concluded that accumarray is not doing something like (using the syntax from the FRP):

for ii = subs
A(ii) = sum(val(subs(1 : ii)));
end

Before I wrote my last comment, I did review some of the entries in the newsgroup. Yet, I didn’t find one where I could understand the data flow.

Maybe you could review the specification for accumarray and that would provide some text that would help explain what the function is doing.

By the way, MatLab has the best documentation of any software that I’ve ever used and I tell my bosses that several times a year. But, it isn’t perfect and the shortcomings in documentation are a productivity issue.

Like I’ve done before, I’ll tell this to Scott when he visits us next week.

Thanks.

Oliver-

Thanks for your explanation.

I see @sin as how to specify a function handle, but not in one of the examples. I stand by the documentation I previously quoted:

“A = accumarray(subs,val,sz,fun) applies function fun to each subset of elements of val. You must specify the fun input using the @ symbol (e.g., @sin). The function fun must accept a column vector and return a numeric, logical, or character scalar, or a scalar cell.”

which says vector in -> scalar out, something that sin does not do.

–Loren

Loren,
I think I’m understanding the frustration which Oliver is expressing. accumarray seems to represent a very non-intuitive process. As best I can determine, what it is doing is the following.

% subs is an array of indices
% vals is the data you want to work with

For the first row of subs:
Find all of the rows of subs that match that row and call them subset (pretend that rows 1,3, and 5 all match with: [2 1]
Take all of the values in vals from that subset (e.g. vals([1,3,5]) and call that tempvals
Apply whatever your function is to tempvals and assign that to the output at the index defined by the row of subs which you are working with (i.e. out(2,1) = sum(vals([1,3,5]))
Keep going through until you’ve found all of the unique rows of subs.

Is that an accurate description of what this is doing?

Thanks,
Dan

I think that ACCUMARRAY has one of the worst documentation pages in MATLAB. I also find ACCUMARRY to be probably the number one most confusing function in MATLAB – there might be causation behind this correlation ;)

I also agree with Oliver that using @sin as an example for the function handle input is bad form, since it is not a valid for the function. Furthermore, the ‘issparse’ input argument description is confusing – it’s not testing sparsity but asking for it, so it should be described by something like ‘createsparse’ instead. And also, the forms that use ‘sz’ but not ‘fillval’ don’t specify what will then be used to fill – it turns out by experimentation to be zero, but this should be called out.

For your example, why did you choose to make it slightly more confusing by adding the column of ones to ‘group’ to make ‘indices’? I believe that ‘accumarray(group,data)’ gives the same result, and is clearer in its intent. Also clearer may have been to use [] for the size input argument, since you want the default behavior and are not specifying an explicitly sized output array.

Finally, I believe that consolidator-like tolerance functionality should be added to accumarray.

Thanks,
Eric – former TMWer.

Eric-

After Oliver noted the @sin instance, it was entered into the bug database at MathWorks and will be fixed.

I created the inputs deliberately so people could see (but maybe didn’t) how to work with that indices input more easily than I felt the documentation showed. I may have guessed incorrectly, at least for some of you.

–Loren

I still say that you should take the 1st example from the FRP and show us the equivalent code that produces the same result.

I spent 2 hours last night trying to hack it out such equivalent code and I never got close.

And, this is exactly why I always say that documentation is a productivity issue. If The MathWorks can write a fancy routine like accumarray, they certainly can write a FRP that explains it for the novice user.

Oliver-

I am not sure if this is what you are looking for, but try this:

%% Here's the "Data" for Example 1
val = 101:105;
subs = [1; 2; 4; 2; 4]
%% Here's the accumarray Solution
% We are accumulating values from val based on like-values
% in subs
A = accumarray(subs, val)
%% Another Method
% Here's the outline of perhaps a more "standard" way to
% think of this.
%
% * First find out how many unique indices there are in
% subs.  The length
% of this array corresponds to the maximum index value in
% subs.  This is the size of the output array.
% * Pre-allocate the output array to be the correct size.
% * Loop through the *values* in subs, which range from
% 1:max(subs).
%
%% Find How Many Times Each Index is Repeated
% For indices, find out how many of each value.
n = hist(subs,max(subs))
%%
% Verify that the maximum value in subs is indeed the number
% of bins from calling hist.
max(subs) == length(n)
%%
% Create output array.
Aother = zeros(length(n),1);
%%
% Add each entry in val to the "appropriate" output entry
for k = 1:length(n);
    % Find correct subscript for adding to Aother
    % logical index for val containing n(k) nonzeros
    ind = (subs == k)  
    % verify that the number of nonzeros is the expected
    % amount
    tf = n(k) == nnz(ind) 
    % here are the values in the data from the 
    % corresponding indices
    valk = val(ind) 
    % sum up these values
    Aother(k) = sum(val(ind));  
end
%% Compare Output Values
agree = isequal(Aother,A)

–Loren

Loren,

That is just what I was looking for. Thank you very much.

Now I’m on the way to understanding what it does. Now I’ve got to understand why this would be useful.

A more compact version of the code for the vector only form of subs, is:

for ii = 1 : length(hist(subs, max(subs)));
A(ii) = sum(val((subs == ii)));
end

Glad that helped, Oliver. You are right for the vector case. I put extra statements in as a way of explanation.

–Loren

I find accumarray extremely useful and use it quite often. However, I would concur with other commenters that the documentation could be better. When this function first appeared it took me a while to understand what it does and how it works.

After reading through all the comments, I think I understand the concept of accumarray. It will be pretty useful. However, I still don’t understand the syntax.

>> sums = accumarray(indices, data);
>> sum1 = accumarray(group, data);
>> isequal(sums, sum1)
ans =
1

I don’t understand the meaning of the indices or the illustration of it. This is how I can explain the syntax. The indices or subscripts is to shape the output. That’s why “data” has to be 1-d array. At first, I thought the indices to be (row, col) of data. It is in fact the (row, col) of “sums”.

>> indices = [ones(size(group), group];
>> sums = accumarray(indices, group)
ans =
6 50 7

I’m also confused about the size input. what is the difference of using “numel” and “size”? I read the documentation and I couldn’t understand the difference.

>> sum1 = accumarray(indices, data, [numel(unique(group)) 1], @sum)
>> sum2 = accumarray(indices, data, size(unique(group)), @sum)
>> isequal(sum1, sum2)
ans =
1

Also what does the size input do if indices already shape the output? Or maybe I misunderstand what “indices” does.

ip

Ivan-

Since the data in this case is a column vector, we can either index with just group (1-d indexing) or using subscripts (row, col, where column is always one here). The indices or group is which output in sums does the corresponding data belong to.

size produces m x n or more output values — one for each dimension. numel totals up all the elements and is equal to prod(size(…))

–Loren

Hi Loren

I use the accumarray function very often since it is another way to efficiently make neighborhood operations besides sparse matrices. I’d like to see two things in the future:

1. accumarray should be able to take various functions so that it is possible to replace following statements

A = accumarray(ind,val,[n 1],@max);
B = accumarray(ind,val,[n 1],@min);

by something like this

[A,B] = accumarray(ind,val,[n 1],{@min @max});

2. accumarray should be able to make some kind of reference to the indices in val. A few month ago I posted a question on the newsgroup
http://www.mathworks.com/matlabcentral/newsreader/view_thread/155890
Peter Perkins came up with a neat but not very elegant idea of how to handle this. But this could be perhaps treated much better.

Best regards,
Wolfgang

You can accomplish (1) by creating a scalar cell as output from an anonymous function that combines your two functions: min and max.

f = @(x) { [min(x) max(x)] } % take note of the brackets

C = accumarray(ind,val,[n 1], f );

C{1} will have both the min and max in it for the 1st subgroup.

You can achieve (2) using test.m below. I have not tested performance. But the functionality is there. Similar solution exists for annoynmous function.

—Bob.

function A = test

ind = [1 2 3 1 2 3 1 2 3]‘;
val = [0.1:0.1:0.9]‘;
A = accumarray(ind,val,[3 1],@findmax);

function ix = findmax(s);
ix = find(val == max(s),1);
end
end

- Stephen
Thanks a lot. Your suggestion actually does the job but is much more memory demanding. When the number of unique indices becomes large I would prefer calling accumarray twice.

- Bob
thanks, too. But your suggestion has the same problem as the one I suggested. Since accumarray works on the subsets of val it returns the indices inside these subsets as shown by Peter in the aforementioned post.

I tried to post it on the newsfeed but didn’t work for some reason.

Wolfgang,

How about this:

A = accumarray(ind,1:numel(val),[3 1], @(x) findmax(x,val))

where

function ix = findmax(indx, s)
[m,ix] = max(s(indx));
ix = indx(ix);

you can think about this function as array of collectors,the result array. data is what to proceed, indices specify a collector in the collector matrix,the result.fun designate the function of each collector,that is,what to do with everything you put in it.
so, what the function do is somewhat like that: scan the data, put each element into a certain collector(specified by the first argument, index),the collector receive what you put in and act the fun.

How do i find the maximum of two or more values in matlb.
to be precise, i need to use the max value for further calculations. Do we have a direct function to find the max value. Something like in excel worksheet =max(value1, value2,…)

Satishs-

Check out the help for the function max. In MATLAB, type either of the following 2 statements:

doc max
help max

–loren

When I discovered accumarray it was a revelation. Since then I’ve used it in many different situations do things very easily that at first seemed very tricky.

However, recently a couple of people have suggested faster alternatives for some common uses which accumarray should really excel at. See:
http://www.mathworks.com/matlabcentral/newsreader/view_thread/244207
for an example.

I find this slightly disappointing because I’ve been such a proponent of the function.

I guess I was one of those people. The ultimate gist of that thread was not so much that the original poster shouldn’t use ACCUMARRAY, but rather that sometimes there are ways to make the “obvious” ACCUMARRAY solution even faster. ACCUMARRAY seemed to be much faster than the OP’s code (which used loops and logical indexing), regardless. The “sparse trick” was a competitor, and in fact SPARSE is (literally) just another way of calling ACCUMARRAY to do summation.

The reason why the “obvious” ACCUMARRAY solution, involving calling it with @mean as the “accumulation function”, is slower than both the sparse trick and the solution I proposed in that thread is that ACCUMARRAY has “summation” built right in — to accumulate by summing, it doesn’t make any function calls at all (it uses “+=” in fact). If you call it with an “accumulation function”, then it has to call that function on each bin of data. That’s the difference.

Is the speed of my posted solution worth the obscurity? You get to decide.

Tom-

I think accumarray was designed with numeric behavior in mind so cells probably weren’t considered. I recommend you enter this in as an enhancement request including an example of how/why you’d like it to work with cells. Thanks.

Support requests

–Loren

Hi Loren,

I was wondering if there is a way to use accumarray to bin values into multiple bins and then take a sum. For example if I have a vector [1 2 3 4 5] and I wanted to get the running sum of 2 elements i.e. [1+2 2+3 3+4 4+5] .. is there a simple way of doing this with accumarray? In a case like this I want the second index to be in bin 1 and bin2 and so on.

Anshul-

I know of no way to do this with accumarray. In your specific example, the function filter can help you out.

–Loren

Anshul-

sum(reshape(vector,[2,length(vector)/2]));

or round off the indices:
accumarray(floor((1:length(vector))-1)/2)+1,vector);

hello there

can I use accumarray to add histograms with different number of bins but same bin widths ?

Say I have the following two *histograms*

dbins1 = 0.0250 0.0750 0.1250 0.1750 0.2250 0.2750 0.3250 0.3750 0.4250 0.4750;
histv1 = 4 2 10 3 20 50 31 11 22 66;

dbins2 = 0.2250 0.2750 0.3250 0.3750 0.4250 0.4750 0.5250 0.5750 0.6250 0.6750 0.7250 0.7750 0.8250;
histv2 = 6 55 10 45 10
2 14 7 6 20 45 8 4;

and I need to add them out making sure I add only histv values of the same dbins. Can this be achieved without loops ? using accumarray or other vectorized matlab code?

Thank you

I use it for this kind of histogram here

can accumarray be used for something like this :

http://www.rt-image.com/content=7304J05C4876948640969A7644A0B0441

Artful Dodger,

I don’t understand what output you want from your 4 arrays exactly. I might be tempted to use intersect and find where the overlapping bins are, and use those pieces of both histograms to add together and retain the first part of the first one, and the latter part of the second one. But I could well be misunderstanding what you are trying to achieve.

–Loren

I noticed the accumarray() function being used in the most recent matlab contest (compressive sensing). I couldn’t tell how it is used from either the code or the documentation. The documentation for this function is atrocious (I’m currently running R2009b).

To better understand what it is doing, I wrote code to replicate the functionality (shown below).

This post is an attempt to improve the documentation (I aim to not confuse the reader).

I think what confused me the most was the indexing for the first input. The ‘subs’ input is of size [NxM], where the M is the dimension of the output array, and N should be equal to the number of elements in the array vals.

Each element in vals is moved into the output specified by the [1 x M] subscript in the corresponding position in subs.

So: All elements, such as “vals(i)” are moved into the output array positions “output(subs(i,:))” after transformation by the specified function (default is sum).

% Francis Esmonde-White, May 6, 2010

% example input.

val = 101:105;
subs = [1 1 1; 2 1 2; 2 3 2; 2 1 2; 2 3 2];

% equivalent functionality to basic accumarray

output_dimensions = max(subs);
output = zeros(output_dimensions);

for ix=1:numel(output_dimensions)
    subs2{ix} = subs(:,ix)';
end

ind = sub2ind(output_dimensions,subs2{:});

ind_list=unique(ind);
for ix = ind_list
    % note that the operation is done here. The sum function can be
    % substituted for (an)other function(s).
    output(ix)=sum(val(ind==ix));
end

Francis-

Thanks for your thoughtful post. I have passed it along to the documentation team. I expect it will provide a lot of insight. Thanks again.

–loren

Like Francis, I find the documentation for ACCUMARRAY difficult to understand. Most people I’ve talked to don’t use this function either because they haven’t heard of it or don’t understand it.

One thing that I can see as an issue is its name – I wish it was named something like AGGREGATEANDCALC or AGGREGATEVLAUES, which to me is in line with what it actually does. Also, the name ACCUMARRAY sells the function short, since you can do so much more than sum the values – when searching the doc it can be hard to come up with it as the function to solve your problem, and its use can be confusing to people looking at your code as to why you’re using a function with that name to do something completely unrelated to accumulating…

Regards,
EBS

How about changing the first example in the doc to something like this (see the expanded comments), I hope the formatting holds up here:

Create a 5-by-1 vector and sum values for repeated 1-D subscripts:

val = 101:105;
subs = [1; 2; 4; 2; 4]
subs =
1
2
4
2
4

A = accumarray(subs, val)
A =
101 % A(1) = sum(val(subs==1)) = 101
206 % A(2) = sum(val(subs==2) = val(2)+val(4) = 102+104 = 206
0 % A(3) = sum(val(subs==3)) = 0
208 % A(4) = sum(val(subs==4)) =val(3)+val(5) = 103+105 = 208

Loren,

Thanks for the quick reply! I agree with EBS’s comment and suggestion. Including additional comments in the documentation to better explain the output would help a lot.

The function name is definitely not intuitive, and a name such as “IndexedFEval()” or “ifeval()” might improve comprehension (short for indexed function evaluation- or something similar). Usage of this function seems more closely related to FEVAL, ARRAYFUN, or EVAL than accumulation, even though none of them are linked to each-other in the “See also” segments of the documentation.

In a similar vein, it might actually improve comprehension if there was no default function (currently ‘@sum’), so that it is immediately obvious that this is a generally-applicable function, and not only a sub-case of array accumulation that can be cleverly manipulated for other purposes. Perhaps leaving the default calling command for accumarray() intact and adding a more intuitive calling interface (without the default function option) such as ifeval() would help.

Thanks again,
Francis

A = accumarray(subs, val); % or, hopefully sometime soon, A = ifeval(subs, val, [1, 4], @sum);

% A(1) = sum(val(subs==1)) = 101
% A(2) = sum(val(subs==2) = val(2)+val(4) = 102+104 = 206
% A(3) = sum(val(subs==3)) = 0
% A(4) = sum(val(subs==4)) =val(3)+val(5) = 103+105 = 208

I agree with Francis’ suggestion, I’ve also thought that it would be great to have a new function with a more general name like IFEVAL or IF_EVAL that could just be a wrapper for ACCUMARRAY.

I actually had intended for my example to be more like the following, where each line is very explicit:

A = ifeval(subs, val, [1, 4], @sum)

% A(1) = sum(val(subs==1)) = 0 + val(1) = 101
% A(2) = sum(val(subs==2) = 0 + val(2)+val(4) = 102+104 = 206
% A(3) = sum(val(subs==3)) = 0 + [] = 0
% A(4) = sum(val(subs==4)) = 0 + val(3)+val(5) = 103+105 = 208

You could then show the same example in a more generalized context like you showed earlier:

% * First find out how many unique indices there are in
% subs.  
n = hist(subs,max(subs))
% The length of this array corresponds to the maximum
% index value in subs.  This is the size of the output array.
% * Pre-allocate the output array A2 to be the correct size
A2 = zeros(length(n),1);
%
% Add each entry in val to the "appropriate" output entry
for k = 1:length(n);
    % Find correct subscript for adding to A2
    % logical index for val containing n(k) nonzeros
    ind = (subs == k)
    % here are the values in the data from the
    % corresponding indices
    val_k = val(ind)
    % sum up these values
    A2(k) = sum(val_k)
end
% check to make sure both approaches give the same answer;
isequal(A,A2)

Regards,
Eric

Loren,

Thanks to the discussion above, I already found a solution for the problem I encountered with accumarray, but nevertheless, I would like to share it with you. For my simulation work I need to create large (sparse) matrices. At one point, accumarray gave me the error that I exceeded the maximum variable size allowed by my computer. An easy way to trigger this error is:

A=accumarray([4e5;4e5]*[1 1],1,[4e5 4e5],[],[],true);

which shoud give me a sparse array with only one element at (4e5,4e5) with value 2, but instead gave me the error. A way around this problem is to use

A=sparse([4e5;4e5],[4e5;4e5],1);

which gives me the result I want. Now A(4e5,4e5)=2. However, with this method you cannot use any other function than the summation to accumulate your results.

Accumarray is a very powerful function that I like to use, but it would be nice if the maximum value could be increased. By the way, I noticed that another computer (64 bits and a newer Matlab version) did allow a larger number, but the also here, the ‘sparse’ function can still surpass ‘accumarray’.

Regards,
Ezra

I’ve just spent hours trying to understand accumarray, despite reading the 2011b documentation several times – has it been improved since the original complaints?

One simple change would have helped me a lot. The documentation says “the value of an element in subs determines the position of the accumulated vector in the output”. This is highly misleading. If only it had stated that each *row* of the subs matrix represents the indices of one element of the output array, I’d have understood far more quickly.

Until I realised this (from looking your first blog example), which took me a long time, I just couldn’t make sense of the second example in the documentation. It’s really disappointing that a useful function is let down by this truly dreadful documentation.

David, like you, I am not happy that the doc has let you down here. We take this sort of feedback to heart, so thank you. As an experiment, I’d be curious if what you get from “help accumarray” (quoted below) strikes you as easier to understand than what you get from “doc accumarray” (which is what I think you are citing.

- Peter Perkins
The MathWorks, Inc.

>> help accumarray
accumarray Construct an array by accumulation.
A = accumarray(SUBS,VAL) creates an array A by accumulating elements of the vector VAL using the subscripts in SUBS. Each row of the M-by-N matrix SUBS defines an N-dimensional subscript into the output A. Each element of VAL has a corresponding row in SUBS. accumarray collects all elements of VAL that correspond to identical subscripts in SUBS, sums those values, and stores the result in the element of A corresponding to the subscript. Elements of A that are not referred to by any row of SUBS contain zero.

Peter, thank you. Yes, I was looking at the doc, not the help, and yes, I agree that the help you quote is far better – in fact it exactly addresses the main difficulty I had in understanding the function.

David

Hi Loren,

I am a bit confused about the use of accumarray, and need to figure this out:

I have a large cell array of 2 columns. Column 1 has strings from a finite set (assume that we do not know beforehand how many unique elements are there in the set). I need to group this cell array based on the unique strings of column 1. Col 2 has numerical values. That is: all rows with string 1 in col 1 will go into a separate cell array and so on for as many unique strings as there are in col 1. Is this possible using accumarray?

Thanks!

Par, yes you can. Your data are perhaps not in the most convenient form, but I’ll say more in a moment.

First, set up some data like what you described.

>> strs = {'a' 'bb' 'ccc'}';
>> s = strs(randi(3,10,1));
>> x = randn(10,1);
>> c =  [s num2cell(x)]
c = 
    'bb'     [ 0.74808]
    'ccc'    [-0.19242]
    'ccc'    [ 0.88861]
    'ccc'    [-0.76485]
    'bb'     [ -1.4023]
    'a'      [ -1.4224]
    'a'      [ 0.48819]
    'a'      [-0.17738]
    'ccc'    [-0.19605]
    'a'      [  1.4193]

Now pretend we don’t know where c came from.

Create the group indices from the strings …

>> [u,~,i] = unique(c(:,1));

… and group the numeric data by the indices

>> groups = accumarray(i,cell2mat(c(:,2)),size(u),@(t) {t});
>> groups{:}
ans =
      -1.4224
     -0.17738
      0.48819
       1.4193
ans =
      0.74808
      -1.4023
ans =
     -0.19242
     -0.19605
      0.88861
     -0.76485

Or compute grouped means

>> groupMeans = accumarray(i,cell2mat(c(:,2)),size(u),@mean)
groupMeans =
     0.076938
      -0.3271
    -0.066178

Now, ACCUMARRAY requires the second input to be numeric, so in the above, you have to convert the second column of your cell array to a numeric column. I imagine you have this one cell array because you wanted to mix string and numeric data in a single array.

If you have access to the Statistics Toolbox, you may find that using a dataset array instead of a cell array makes your life easier. In fact, there’s a function called GRPSTATS that will do both of the above for you. You may also find that using a nominal (or ordinal) array for your string data, rather than a cell array makes your life easier too. The dataset array would then contain one nominal and one numeric column. For example:

>> s = nominal(s);
>> d = dataset(s,x)
d = 
    s      x       
    bb      0.74808
    ccc    -0.19242
    ccc     0.88861
    ccc    -0.76485
    bb      -1.4023
    a       -1.4224
    a       0.48819
    a      -0.17738
    ccc    -0.19605
    a        1.4193
>> groupMeans = grpstats(d,'s')
groupMeans = 
           s      GroupCount    mean_x   
    a      a      4              0.076938
    bb     bb     2               -0.3271
    ccc    ccc    4             -0.066178

Hope this helps.

Thanks a lot, Peter. That really helps! Your example is also a great aid to understand how this works. Appreciate your help.

Just one more question: The elements in the subsets, i.e., in the “groups” cell array, do not seem to be in any particular order. I thought they might be sorted in some order, either in ascending/descending, or in the order of occurrence in the “c” array. But they seem to be randomly ordered. Is that so? Could you please throw some light on this? Of course, it is no big deal to use the sort command to sort them, but if I know they are already sorted, then I wouldn’t bother about sorting as the data I am dealing with is quite large, about 10-15 million elements.

By the way, another quick question: If we are unsure about the sort status of an array, then I know we can use the ‘issorted’ command to check. Now if the array happens to be already sorted, then does the redundant checking cost any computation time? Even if the array is very large?

Thanks!

Par, the help for ACCUMARRAY says, “Note: If the subscripts in SUBS are not sorted, FUN should not depend on the order of the values in its input data.” What that means is what you observed: for FUN == @(t) {t}, the order of the elements in the output is not predictable. HOWEVER, if the subscripts _are_ sorted, then the order of the elements will be what you expect. So, if the order of the elements in the output is critical, you can always sort SUBS, and reorder VAL in parallel to that using the second output from SORT.

The following demonstrates that using ISSORTED is a win, at least for large arrays:

>> x = randn(10000000,1);
>> xs = sort(x);
>>
>> tic, y = sort(x); toc
Elapsed time is 0.429657 seconds.
>> tic, y = sort(xs); toc
Elapsed time is 0.112595 seconds.
>> tic, if ~issorted(x), y = sort(x); end, toc
Elapsed time is 0.420848 seconds.
>> tic, if ~issorted(xs), y = sort(xs); end, toc
Elapsed time is 0.021241 seconds.

Got it! Now I understand the utility of this function much more clearly. Thank you so much for your explanations.

About the unpredictable order of the resulting subset elements, I wonder then, the inner working of accumarray is not as simple as scanning the input array element by element and dropping each element into its appropriate bin as the code moves forward. I am guessing that would have been quite inefficient?

Sorry Peter. I completely overlooked another requirement in my problem description earlier. Actually, I have two more columns in the array ‘c’, for example, ‘c’ looks like this:


c = 
    'ccc'    [16]    [71]    'n'
    'ccc'    [98]    [ 4]    'p'
    'a'      [96]    [28]    'n'
    'ccc'    [49]    [ 5]    'n'
    'bb'     [81]    [10]    'n'
    'a'      [15]    [83]    'n'
    'a'      [43]    [70]    'n'
    'bb'     [92]    [32]    'p'
    'ccc'    [80]    [96]    'p'
    'ccc'    [96]    [ 4]    'p'

When I put the grouped elements in their respective bins, the corresponding elements in columns 3 and 4 also need to go into the bins in the same rows as the elements of col 2. Is this possible?

Thank you.

Hello Again Peter,

Continuing on my latest reply, I figure there are two ways to solve the problem. One is to first sort the original cell array based on its first column so that when I use ‘unique’ to get the ‘n’ vector, it will be sorted. Then, I can use 3 separate accumarray statements as follows:

groups2 = accumarray(n,cell2mat(c1(:,2)),size(u),@(t) {t});
groups3 = accumarray(n,cell2mat(c1(:,3)),size(u),@(t) {t});
groups4 = accumarray(n,cell2mat(c1(:,4)),size(u),@(t) {t});

The other solution I can think of is to first get the list of unique strings using ‘unique’ and then use a loop along with ‘strcmpi’ to find the indices for all the occurrences of each unique string in the main cell array and accumulate the rows manually in separate cells or the fields of a structure.

Other than these two solutions, is there a better way?

Thanks a lot!

Gursu, meanangle appears to be a submission to the MATLAB File Exchange. I know nothing about it, but since accumarray accetps a function handle, I can only assume that you could pass accumarray a function handle to meanangle.

Par -

1) You should not depend on the internal implementation of accumarray. Stick to what the help tells you.

2) There are probably several ways to do what you’re asking about, but here’s the way I’d approach it (I’ll leave it as an exercise to figure out how this works):

   [u,~,i] = unique(c(:,1));
   r = (1:size(c,1))';
   groups = accumarray(i,r,size(u),@(t) { c(t,2:4) });

Got it Peter, completed the exercise :) Your solution is quite elegant. I am happy that I finally understood this function. The fact that you can pass a function handle makes this function so powerful!

These postings are the author's and don't necessarily represent the opinions of MathWorks.