Loren on the Art of MATLAB

October 13th, 2008

Vectorizing the Notion of Colon (:)

The other day, one of my MathWorks buddies, Peter, asked me if I knew any tricks for vectorizing the notion of the colon (:) operator. For me, this sort of inquiry sometimes has an effect similar to waving a flag in front of a bull! The challenge was on.

I know people have posted methods in the past on the MathWorks newsgroup. I quickly had a few more ideas and implemented them.

Contents

Problem Statement

Suppose I have matching pairs of integers depicting the starting and stopping indices for selecting points from another vector. I've shown in other posts that vectorizing the indexing can have tremendous performance benefits for your code. What I'd like to do is create one full list of indices so I ultimately only index into the data vector once.

Solution Variants

Here are a few ways to solve the problem.

  • Build up the array of indices, looping (for) over each pair, one at a time.
  • Write the indices to a string and have MATLAB evaluate the various : expressions.
  • Use arrayfun (two algorithms) as the engine for creating the output.
  • Modify the idea of run-length decoding, inspired by Peter Acklam's run length decoder. MathWorks Peter used this as a basis for the rld-based algorithm.

Test the Solutions

Let's prove that the solutions all get the same answers first before delving into the algorithm details. I'll give a vector of starting indices followed by a vector of ending indices. If an ending value is smaller than a starting value, that pair is effectively skipped. Duplicate starting points are allowed.

starts = [1 7 7 15 17];
ends = [3 8 6 15 19];
expectedAnswer = [1 2 3 7 8 15 17 18 19];

Round up the files to run.

d = dir('coloncat*.m');
d.name
ans =
coloncatarrf1.m
ans =
coloncatarrf2.m
ans =
coloncatfor.m
ans =
coloncatrld.m
ans =
coloncatstr.m

Run each file with the test data.

for k = 1:length(d)
    func = str2func(d(k).name(1:end-2));
    outputs{k} = func(starts, ends);
end

Validate that all the methods calculated the expected results.

rightAnswers = isequal(expectedAnswer,outputs{:})
rightAnswers =
     1

Look at the Algorithms

Let's look at the variety of algorithms and coding techniques I used. If you look at the first chunk of code in each file, you will see they are algorithmically identical - checking for sequences that have length of at least one (1) and deleting the other entries. What we really want to compare are the lines that follow that section.

Use a for Loop

This is perhaps the most straightforward method.

dbtype coloncatfor
1     function x = coloncatfor(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % find end and beginning indices for each chunk
19    partialLength = cumsum(len);
20    cumStart = [1 partialLength(1:end-1)+1];
21    
22    % preallocate output
23    % then loop through start/stop pairs, in order, and fill
24    numtot = sum(len);
25    x = zeros(1,numtot);
26    for k = 1:length(start)
27        x(cumStart(k):partialLength(k)) = start(k):stop(k);
28    end

After culling the start and stop indices, I calculate where each successive sequence ends (partialLength) and where they each begin (cumStart). I next preallocate the output array to its final size. And finally, as I loop through the matched pairs of start and stop values, I generate the colon sequence and fill these values into the correct output locations.

Use sprintf and eval

This method uses sprintf which allows me to gather all of the index start and stop values into a string without looping. I use eval evaluate the string, thereby computing the output.

dbtype coloncatstr
1     function x = coloncatstr(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % create list of indices for : and evaluate
19    indices = [start; stop];
20    list = sprintf('%d:%d ',indices);
21    x = eval(['[', list, ']']);

Use arrayfun - Method 1

The next two methods use arrayfun to do the work. Here's the first approach.

dbtype coloncatarrf1
1     function x = coloncatarrf1(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % put entries into a cell array for later use by arrayfun
19    cstart = num2cell(start);
20    cstop = num2cell(stop);
21    vals = arrayfun(@(x,y){x{1}:y{1}},cstart,cstop);
22    x = [vals{:}];

Place the start and stop values into cell arrays to feed to arrayfun. The entries for the two arrays act in pairs, and the function that I have arrayfun apply places in a cell array the sequence of values between the numbers in the paired cells. Finally, take all the output cells in the cell array and string out the values into a single numeric vector.

Use arrayfun - Method 2

Here's the second approach using arrayfun. It involves a for loop as well.

dbtype coloncatarrf2
1     function x = coloncatarrf2(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % Find unique sequence lengths
19    uniqlens = unique(len);
20    
21    % preallocate space for cell outputs of arrayfun
22    chunks = cell(1,length(len));
23    
24    % Now loop over the lengths, being sure to place the 
25    % outputs in the right positions.
26    for ilen = uniqlens(:)'
27        idxlen = find(len==ilen);
28        chunks(idxlen) = arrayfun(@(xx,yy)xx:yy,...
29            start(idxlen),stop(idxlen),'UniformOutput',false);
30    end
31    % reassemble chunks into output
32    x = [chunks{:}];

The concept here is to loop over sequences that have the same length. At most, this for loop is the same length as the number of values in the initial start vector. If the length of the inputs is long, however, it might be advantageous to loop over sequences that generate the same length (commensurate size) output. Placing the output into cells allows me to not worry that I might be intermingling output vectors of different sizes. Once all the output sequences are collected in the cell array, I flatten it into a vector the same way I did with method 1.

Use Run-Length Decoding Idea

Here's an algorithm based on the idea of run-length decoding.

dbtype coloncatrld
1     function x = coloncatrld(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     % Based on Peter Acklam's code for run length decoding.
7     len = stop - start + 1;
8     
9     % keep only sequences whose length is positive
10    pos = len > 0;
11    start = start(pos);
12    stop = stop(pos);
13    len = len(pos);
14    if isempty(len)
15        x = [];
16        return;
17    end
18    
19    % expand out the colon expressions
20    endlocs = cumsum(len);  
21    incr = ones(1, endlocs(end));  
22    jumps = start(2:end) - stop(1:end-1);  
23    incr(endlocs(1:end-1)+1) = jumps;
24    incr(1) = start(1);
25    x = cumsum(incr);

The idea is to expand the colon sequences out. I know the lengths of each sequence so I know the starting points in the output array. Fill the values after the start values with 1s. Then I figure out how much to jump from the end of one sequence to the beginning of the next one. If there are repeated start values, the jumps might be negative. Once this array is filled, the output is simply the cumulative sum or cumsum of the sequence.

Other Methods?

These methods I've shown, for vectorizing the notion of the colon operator, are ones that came very quickly. I haven't spent concerted time looking for other algorithms though I am positive some others, including some clever ones, probably exist. Can you help me and provide other possible algorithmic ideas? Please post them here.


Get the MATLAB code

Published with MATLAB® 7.7

14 Responses to “Vectorizing the Notion of Colon (:)”

  1. matt fig replied on :

    It’s not super fast, and I haven’t provided error checking, but just another method that comes to mind:

    function x = coloncatcell(start, stop)

    c = cellfun(@colon,num2cell(start),num2cell(stop),’UniformOutput’,0);
    x = cat(2,c{:});

  2. Jason Merrill replied on :

    Wow Matt, very slick. I hadn’t realized you could use cellfun to zip together more than one cell array like that. Following your lead, here it is as a one liner:

    coloncatshort = @(starts,ends) cell2mat(arrayfun(@(a,b) {a:b},starts,ends))

    isequal(coloncatshort(starts,ends),expectedAnswer)

    ans = 1

  3. Jason Merrill replied on :

    Note that Matt’s method doesn’t need to do any extra checks for positive length sequences. It will do the right thing with negative length sequences (i.e. ignore them).

  4. Jos replied on :

    Here is a vectorized fast solution using “core” matlab, similar to the run-length decoding algorithm. I was inspired by one of my submissions to the FEX (NONES). The drawback is that it creates a (possibly large) intermediate array:

    starts = [1 7 7 15 17];
    ends = [3 8 6 15 19];

    len = ends - starts + 1 ;
    A = [len(:) repmat(-1,numel(len),max(len)-1)].’ ;
    A = cumsum(A) > 0 ;
    B = double(A) ;
    B(1,:) = starts ;
    B = cumsum(B) ;
    C = B(A).’

  5. Stefan replied on :

    I’ve been trying to find out a bit more about the performance of cellfun and arrayfun. I use these functions quite regularly to step through large cell arrays and I’m curious how they compare to doing a standard for loop (as discussed in this article.)

    Does anyone have any thoughts pertaining to this?

    Thanks!

  6. Daniel Armyr replied on :

    Hi.
    When reading the article I realize i didn’t quite understand the actual problem. Now that I do understand it, I wonder if this means that there will be plans to update the colon() to allow array inputs? Right now it seems it takes the very first value fo any array input and ignores the rest.

    –DA

  7. Loren replied on :

    Nice additional algorithms, everyone!

    Daniel,
    This article does not imply any changes to the colon operator, or even the introduction of such a function into MATLAB. We will always consider requests along these lines if we hear enough from customers.

    –Loren

  8. Tom Richardson replied on :

    Note that the one line solutions easily generalize to variable increments, e.g.,

    >> coloncatshort = @(starts,steps,ends) cell2mat(arrayfun(@(a,b,c) {a:b:c},starts,steps,ends))
    coloncatshort =
    @(starts,steps,ends)cell2mat(arrayfun(@(a,b,c){a:b:c},starts,steps,ends))
    >> coloncatshort([1 3 2],[4 4 2],[5 7 8])
    ans =
    1 5 3 7 2 4 6 8

  9. Matt Fig replied on :

    I am often in need of vectorizing an operator where the inputs aren’t the same size. This seems more challenging to me. For instance: given an array MAT = M(IDX) and a vector of unique values M, find IDX. This involves the == operator with different length inputs. The for loop approach is straightforward:

    function [IDX] = getidx(MAT,M)
    % Given MAT = M(IDX) and M, this finds IDX.
    % Where length(unique(M)) = length(M).

    IDX = zeros(size(MAT));

    for ii = 1:length(M)
    IDX(MAT==M(ii)) = ii;
    end

    I am at a loss to vectorize this, perhaps I just need more sleep :)

  10. Loren replied on :

    Tom-

    Nice contribution!

    Matt-

    Check out bsxfun. I’ve talked about it in the blog once or twice.

    –Loren

  11. Matt Fig replied on :

    Rather than go too off topic (against Loren’s rules!), I created a thread on the newsreader that addresses my question. Anyone who is interested may find it here:

    http://www.mathworks.com/matlabcentral/newsreader/view_thread/237608

  12. Lucio replied on :

    I am not sure if this is bullet proof for all the cases, but I usually write something like this for this type of problems:

    len = max(max(starts),max(ends))+1;

    find(cumsum(accumarray(starts’,1,[len,1])-accumarray(ends’+1,1,[len,1])))

    Lucio

  13. Walter Roberson replied on :

    A problem with the given solutions is that they do not handle the ‘end’ operator — a strange operator that contextually knows what the appropriate limits are for each dimension.

    http://groups.google.ca/group/comp.soft-sys.matlab/browse_thread/thread/c1ef54d585340aea/00766f55e26eaf6c

    Doug Schwarz wrote:
    > b = a(:,setdiff(1:end,1));

    which unexpectedly works, and I explored why this is unexpected (and hence why no user-level routine can
    duplicate the meaning of ‘end’)

  14. Loren replied on :

    Walter-

    You can use “end” in some expressions, IF the expressions are used for indexing into an array as Doug’s expression does. In my tests up above, you can’t use “end” because I am simply generating the sequence without indexing. In addition, Doug’s works because the value of “end” in that context is possible for MATLAB to ascertain. I think many, perhaps all, of the other solutions don’t have the indexing context explicit enough at the call site for MATLAB to discern the meaning of “end”.

    –Loren

    –Loren

Leave a Reply

Wrap code fragments inside <pre> tags, like this:

<pre class="code">
a = magic(3);
sum(a)
</pre>

If you have a "<" character in your code, either follow it with a space or replace it with "&lt;" (including the semicolon).


Loren Shure works on design of the MATLAB language at The MathWorks. She writes here about once a week on MATLAB programming and related topics.

  • Loren: Paul- There *are* issues depending on the sizes of ii and jj. And it’s a bit complicated, but really...
  • Loren: Bob- You don’t say what happens when you run your code. Can you please explain more. It looks like you...
  • Loren: Kishore- It is not clear to me what you are trying to actually achieve. If you want to concatenate the 4...
  • Kishore: sorry, in the previous code mat2cell(c,[19 121],[19 134],[19 84],[19 107])
  • Kishore: Hi Loren, Why does the following not work? data_classwise = [19x121 double] [19x134 double] [19x84 double]...
  • Paul Jackson: Loren, Are there any aspects of empty matrices that may be tricky when they are used as indices into...
  • Bob: Hi Lori, Im trying to process Unicode text files from more than one different locales than the standard latin...
  • Loren: Ben- The reference link in my post documents the behavior of sum([]) and prod([]) (although the prod part only...
  • Ben: Loren/Andrey, A further advantage of having sum([])==0 and prod([])==1 is that it’s consistent with array...
  • Loren: OysterEngineer- I will SO take you up on that offer. Can’t wait for a good reason to visit now....

These postings are the author's and don't necessarily represent the opinions of The MathWorks.