Vectorizing the Notion of Colon (:)

저자 Loren Shure, October 13, 2008

10 회 조회 (최근 30일) | 0 좋아요 | 14 댓글

The other day, one of my MathWorks buddies, Peter, asked me if I knew any tricks for vectorizing the notion of the colon (:) operator. For me, this sort of inquiry sometimes has an effect similar to waving a flag in front of a bull! The challenge was on.

I know people have posted methods in the past on the MathWorks newsgroup. I quickly had a few more ideas and implemented them.

Problem Statement
Solution Variants
Test the Solutions
Look at the Algorithms
Use a for Loop
Use sprintf and eval
Use arrayfun - Method 1
Use arrayfun - Method 2
Use Run-Length Decoding Idea
Other Methods?

Problem Statement

Suppose I have matching pairs of integers depicting the starting and stopping indices for selecting points from another vector. I've shown in other posts that vectorizing the indexing can have tremendous performance benefits for your code. What I'd like to do is create one full list of indices so I ultimately only index into the data vector once.

Solution Variants

Here are a few ways to solve the problem.

Build up the array of indices, looping (for) over each pair, one at a time.
Write the indices to a string and have MATLAB evaluate the various : expressions.
Use arrayfun (two algorithms) as the engine for creating the output.
Modify the idea of run-length decoding, inspired by Peter Acklam's run length decoder. MathWorks Peter used this as a basis for the rld-based algorithm.

Test the Solutions

Let's prove that the solutions all get the same answers first before delving into the algorithm details. I'll give a vector of starting indices followed by a vector of ending indices. If an ending value is smaller than a starting value, that pair is effectively skipped. Duplicate starting points are allowed.

starts = [1 7 7 15 17];
ends = [3 8 6 15 19];
expectedAnswer = [1 2 3 7 8 15 17 18 19];

Round up the files to run.

d = dir('coloncat*.m');
d.name

ans =
coloncatarrf1.m
ans =
coloncatarrf2.m
ans =
coloncatfor.m
ans =
coloncatrld.m
ans =
coloncatstr.m

Run each file with the test data.

for k = 1:length(d)
    func = str2func(d(k).name(1:end-2));
    outputs{k} = func(starts, ends);
end

Validate that all the methods calculated the expected results.

rightAnswers = isequal(expectedAnswer,outputs{:})

rightAnswers =
     1

Look at the Algorithms

Let's look at the variety of algorithms and coding techniques I used. If you look at the first chunk of code in each file, you will see they are algorithmically identical - checking for sequences that have length of at least one (1) and deleting the other entries. What we really want to compare are the lines that follow that section.

Use a for Loop

This is perhaps the most straightforward method.

dbtype coloncatfor

1     function x = coloncatfor(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % find end and beginning indices for each chunk
19    partialLength = cumsum(len);
20    cumStart = [1 partialLength(1:end-1)+1];
21    
22    % preallocate output
23    % then loop through start/stop pairs, in order, and fill
24    numtot = sum(len);
25    x = zeros(1,numtot);
26    for k = 1:length(start)
27        x(cumStart(k):partialLength(k)) = start(k):stop(k);
28    end

After culling the start and stop indices, I calculate where each successive sequence ends (partialLength) and where they each begin (cumStart). I next preallocate the output array to its final size. And finally, as I loop through the matched pairs of start and stop values, I generate the colon sequence and fill these values into the correct output locations.

Use sprintf and eval

This method uses sprintf which allows me to gather all of the index start and stop values into a string without looping. I use eval evaluate the string, thereby computing the output.

dbtype coloncatstr

1     function x = coloncatstr(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % create list of indices for : and evaluate
19    indices = [start; stop];
20    list = sprintf('%d:%d ',indices);
21    x = eval(['[', list, ']']);

Use arrayfun - Method 1

The next two methods use arrayfun to do the work. Here's the first approach.

dbtype coloncatarrf1

1     function x = coloncatarrf1(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % put entries into a cell array for later use by arrayfun
19    cstart = num2cell(start);
20    cstop = num2cell(stop);
21    vals = arrayfun(@(x,y){x{1}:y{1}},cstart,cstop);
22    x = [vals{:}];

Place the start and stop values into cell arrays to feed to arrayfun. The entries for the two arrays act in pairs, and the function that I have arrayfun apply places in a cell array the sequence of values between the numbers in the paired cells. Finally, take all the output cells in the cell array and string out the values into a single numeric vector.

Use arrayfun - Method 2

Here's the second approach using arrayfun. It involves a for loop as well.

dbtype coloncatarrf2

1     function x = coloncatarrf2(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     len = stop - start + 1;
7     
8     % keep only sequences whose length is positive
9     pos = len > 0;
10    start = start(pos);
11    stop = stop(pos);
12    len = len(pos);
13    if isempty(len)
14        x = [];
15        return;
16    end
17    
18    % Find unique sequence lengths
19    uniqlens = unique(len);
20    
21    % preallocate space for cell outputs of arrayfun
22    chunks = cell(1,length(len));
23    
24    % Now loop over the lengths, being sure to place the 
25    % outputs in the right positions.
26    for ilen = uniqlens(:)'
27        idxlen = find(len==ilen);
28        chunks(idxlen) = arrayfun(@(xx,yy)xx:yy,...
29            start(idxlen),stop(idxlen),'UniformOutput',false);
30    end
31    % reassemble chunks into output
32    x = [chunks{:}];

The concept here is to loop over sequences that have the same length. At most, this for loop is the same length as the number of values in the initial start vector. If the length of the inputs is long, however, it might be advantageous to loop over sequences that generate the same length (commensurate size) output. Placing the output into cells allows me to not worry that I might be intermingling output vectors of different sizes. Once all the output sequences are collected in the cell array, I flatten it into a vector the same way I did with method 1.

Use Run-Length Decoding Idea

Here's an algorithm based on the idea of run-length decoding.

dbtype coloncatrld

1     function x = coloncatrld(start, stop)
2     % COLONCAT Concatenate colon expressions
3     %    X = COLONCAT(START,STOP) returns a vector containing the values
4     %    [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)].
5     
6     % Based on Peter Acklam's code for run length decoding.
7     len = stop - start + 1;
8     
9     % keep only sequences whose length is positive
10    pos = len > 0;
11    start = start(pos);
12    stop = stop(pos);
13    len = len(pos);
14    if isempty(len)
15        x = [];
16        return;
17    end
18    
19    % expand out the colon expressions
20    endlocs = cumsum(len);  
21    incr = ones(1, endlocs(end));  
22    jumps = start(2:end) - stop(1:end-1);  
23    incr(endlocs(1:end-1)+1) = jumps;
24    incr(1) = start(1);
25    x = cumsum(incr);

The idea is to expand the colon sequences out. I know the lengths of each sequence so I know the starting points in the output array. Fill the values after the start values with 1s. Then I figure out how much to jump from the end of one sequence to the beginning of the next one. If there are repeated start values, the jumps might be negative. Once this array is filled, the output is simply the cumulative sum or cumsum of the sequence.

Other Methods?

These methods I've shown, for vectorizing the notion of the colon operator, are ones that came very quickly. I haven't spent concerted time looking for other algorithms though I am positive some others, including some clever ones, probably exist. Can you help me and provide other possible algorithmic ideas? Please post them here.

Published with MATLAB® 7.7