Partitioning a Vector9

Posted by Loren Shure,

Recently on the MATLAB newsgroup, there was a thread asking how to split up a vector into pieces which were each monotonically increasing by the value 1. The post got several answers which I did not read first. Here's my thinking.

Contents

Sample Data

Here's my sample data.

a=[2,3,4,7,9,12,13,14,15]

a =
2     3     4     7     9    12    13    14    15


Problem and Solution

I want to break this into 4 pieces: the runs 2:4 and 12:15, plus the scalars 7 and 9. Here's how I thought about the problem.

First I want to find the runs, which are elements that differ by 1. So I calculate the differences, making sure I include the final value in the array (which is why I append a 0 below).

ad = [diff(a) == 1 0]

ad =
1     1     0     0     0     1     1     1     0


Next, I figure out how many runs there are, by seeing how many 0 values are represented in the differences. This tells me how many arrays I will split my original array into.

numcells = sum(ad==0)
out = cell(1,numcells);

numcells =
4


Next I find the ending indices of the chunks by looking where ad is 0. Then I start to create contents for each cell in the cell array out, starting with index 1 in the original array.

indends = find(ad == 0);
ind = 1;
for k = 1:numcells
out{k} = a(ind:indends(k));
ind = indends(k)+1;
end


Here's what out looks like now.

out
out{:}

out =
[1x3 double]    [7]    [9]    [1x4 double]
ans =
2     3     4
ans =
7
ans =
9
ans =
12    13    14    15


File Exchange Function

On the File Exchange, you can get a more general purpose function called SplitVec on SplitVec by Bruno. It does this and a whole lot more.

Do You Need to Partition Data?

If you need to partition data, I'd like to understand why, and how you do so. Let me know here.

Get the MATLAB code

Published with MATLAB® R2012b

Jim Ross replied on : 1 of 9

I have run into a need for this type of functionality quite often – often when looking for significant steps in a set of data or splitting a single data file into multiple sets of data based on some criteria.

At times, my data sets have been very large or I have been processing many data sets. I have looked for a way to vectorize this so that I didn’t have to loop (or at least didn’t have to have nested loops). Do you have any thoughts?

Ed replied on : 2 of 9

A one-liner version of your code:

out = mat2cell(a,1,accumarray(1+cumsum([0,diff(a)~=1])’,ones(size(a))’));

Loren Shure replied on : 3 of 9

Jim and Ed-

I timed Ed’s one-liner vs. the code in the blog and they seemed, for SMALL datasets, fairly comparable. I found the for loop was faster more often. For my money, the accumarray solution, though clever, is much harder to read.

SplitVec from the FEX had similar performance though a tad slower usually – I believe because it does so much more inside. But it does have broader functionality.

Jim – Have you tried SplitVec? And you can try Ed’s accumarray solution perhaps.

–Loren

Andrey replied on : 4 of 9

I like Loren’s solution much more. While one-liners look smart and you have a certain feeling of achievement when you manage to get them right, I think that it is one of the worst software engineering practices. That is because it is hard to debug them (as the debugger favors multiple lines) and because they rarely convent the intent of the programmer to the one who reads the code.

Matt Fig replied on : 5 of 9

Here is another alternative solution.

 E = diff(a)~=1; S = a([true E]); % Starting points. E = a([E true]); % Ending points. % Now create the partition. for ii = length(E):-1:1 % Dynamic pre-allocation out{ii} = S(ii):E(ii); % Create vectors. end

I think a person could simply store S and E without actually partitioning the vector, depending on the application of course. All the information is in S and E anyway.

Loren Shure replied on : 6 of 9

Matt-

You are right that S and E might be all that a person needs. I hope people realize that you ARE preallocating the cell array out by filling the contents backwards.

–Loren

Julian replied on : 7 of 9

Interesting blog entry. This pattern often recurs in my work too — I usually construct a “break-point” matrix bp (named after the parameter for detrend) which is just [S;E] that Matt Fig gives. Matt’s approach pretty much identical to mine. A slight difference in mine shown below is to use arrayfun and check it scales down to empty…

E = diff(a)~=1;
S = a([~isempty(a) E]); % Starting points.
E = a([E ~isempty(a)]); % Ending points.
out = arrayfun(@(s,e) s:e, S, E, ‘uni’, 0)

Sometimes I go on to construct a struct array containing info about my groups, such as start, end, nr of points, what changed, text labels, and other meta-data.

Martin Afanasjew replied on : 8 of 9

Here’s another one-liner that is maybe more readable than Ed’s version. It certainly does less arithmetic. But the bottleneck seems to be “mat2cell” anyway (it does its job iteratively).

out = mat2cell(a, 1, diff([0, find(diff(a) ~= 1), length(a)]));

The code assumes it gets a row vector (as per the example). First, “diff” and “find” locate the interior breakpoints. Then, I prepend 0 and append the length of the vector to get the implied breakpoints (the bounds of the vector). Finally, “diff” converts those breakpoints into chunk lengths and leaves the splitting to “mat2cell”.

Andrew replied on : 9 of 9

I agree with Julian’s approach and use code very similar to his quite often. I rarely partition based on monotonically increasing values separated by one. A more realistic example would be noisy periodic data that I wish to partition into a new cell when a state meets some more general logical condition or threshold. Sometimes I’m not interested in keeping all of the original data, but may be just the index or value of the multidimensional array data at points where where the logical condition is met.