Vectorizing the Notion of Colon (:)
The other day, one of my MathWorks buddies, Peter, asked me if I knew any tricks for vectorizing the notion of the colon (:) operator. For me, this sort of inquiry sometimes has an effect similar to waving a flag in front of a bull! The challenge was on.
I know people have posted methods in the past on the MathWorks newsgroup. I quickly had a few more ideas and implemented them.
Contents
Problem Statement
Suppose I have matching pairs of integers depicting the starting and stopping indices for selecting points from another vector. I've shown in other posts that vectorizing the indexing can have tremendous performance benefits for your code. What I'd like to do is create one full list of indices so I ultimately only index into the data vector once.
Solution Variants
Here are a few ways to solve the problem.
- Build up the array of indices, looping (for) over each pair, one at a time.
- Write the indices to a string and have MATLAB evaluate the various : expressions.
- Use arrayfun (two algorithms) as the engine for creating the output.
- Modify the idea of run-length decoding, inspired by Peter Acklam's run length decoder. MathWorks Peter used this as a basis for the rld-based algorithm.
Test the Solutions
Let's prove that the solutions all get the same answers first before delving into the algorithm details. I'll give a vector of starting indices followed by a vector of ending indices. If an ending value is smaller than a starting value, that pair is effectively skipped. Duplicate starting points are allowed.
starts = [1 7 7 15 17]; ends = [3 8 6 15 19]; expectedAnswer = [1 2 3 7 8 15 17 18 19];
Round up the files to run.
d = dir('coloncat*.m');
d.name
ans = coloncatarrf1.m ans = coloncatarrf2.m ans = coloncatfor.m ans = coloncatrld.m ans = coloncatstr.m
Run each file with the test data.
for k = 1:length(d) func = str2func(d(k).name(1:end-2)); outputs{k} = func(starts, ends); end
Validate that all the methods calculated the expected results.
rightAnswers = isequal(expectedAnswer,outputs{:})
rightAnswers = 1
Look at the Algorithms
Let's look at the variety of algorithms and coding techniques I used. If you look at the first chunk of code in each file, you will see they are algorithmically identical - checking for sequences that have length of at least one (1) and deleting the other entries. What we really want to compare are the lines that follow that section.
Use a for Loop
This is perhaps the most straightforward method.
dbtype coloncatfor
1 function x = coloncatfor(start, stop) 2 % COLONCAT Concatenate colon expressions 3 % X = COLONCAT(START,STOP) returns a vector containing the values 4 % [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)]. 5 6 len = stop - start + 1; 7 8 % keep only sequences whose length is positive 9 pos = len > 0; 10 start = start(pos); 11 stop = stop(pos); 12 len = len(pos); 13 if isempty(len) 14 x = []; 15 return; 16 end 17 18 % find end and beginning indices for each chunk 19 partialLength = cumsum(len); 20 cumStart = [1 partialLength(1:end-1)+1]; 21 22 % preallocate output 23 % then loop through start/stop pairs, in order, and fill 24 numtot = sum(len); 25 x = zeros(1,numtot); 26 for k = 1:length(start) 27 x(cumStart(k):partialLength(k)) = start(k):stop(k); 28 end
After culling the start and stop indices, I calculate where each successive sequence ends (partialLength) and where they each begin (cumStart). I next preallocate the output array to its final size. And finally, as I loop through the matched pairs of start and stop values, I generate the colon sequence and fill these values into the correct output locations.
Use sprintf and eval
This method uses sprintf which allows me to gather all of the index start and stop values into a string without looping. I use eval evaluate the string, thereby computing the output.
dbtype coloncatstr
1 function x = coloncatstr(start, stop) 2 % COLONCAT Concatenate colon expressions 3 % X = COLONCAT(START,STOP) returns a vector containing the values 4 % [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)]. 5 6 len = stop - start + 1; 7 8 % keep only sequences whose length is positive 9 pos = len > 0; 10 start = start(pos); 11 stop = stop(pos); 12 len = len(pos); 13 if isempty(len) 14 x = []; 15 return; 16 end 17 18 % create list of indices for : and evaluate 19 indices = [start; stop]; 20 list = sprintf('%d:%d ',indices); 21 x = eval(['[', list, ']']);
Use arrayfun - Method 1
The next two methods use arrayfun to do the work. Here's the first approach.
dbtype coloncatarrf1
1 function x = coloncatarrf1(start, stop) 2 % COLONCAT Concatenate colon expressions 3 % X = COLONCAT(START,STOP) returns a vector containing the values 4 % [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)]. 5 6 len = stop - start + 1; 7 8 % keep only sequences whose length is positive 9 pos = len > 0; 10 start = start(pos); 11 stop = stop(pos); 12 len = len(pos); 13 if isempty(len) 14 x = []; 15 return; 16 end 17 18 % put entries into a cell array for later use by arrayfun 19 cstart = num2cell(start); 20 cstop = num2cell(stop); 21 vals = arrayfun(@(x,y){x{1}:y{1}},cstart,cstop); 22 x = [vals{:}];
Place the start and stop values into cell arrays to feed to arrayfun. The entries for the two arrays act in pairs, and the function that I have arrayfun apply places in a cell array the sequence of values between the numbers in the paired cells. Finally, take all the output cells in the cell array and string out the values into a single numeric vector.
Use arrayfun - Method 2
Here's the second approach using arrayfun. It involves a for loop as well.
dbtype coloncatarrf2
1 function x = coloncatarrf2(start, stop) 2 % COLONCAT Concatenate colon expressions 3 % X = COLONCAT(START,STOP) returns a vector containing the values 4 % [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)]. 5 6 len = stop - start + 1; 7 8 % keep only sequences whose length is positive 9 pos = len > 0; 10 start = start(pos); 11 stop = stop(pos); 12 len = len(pos); 13 if isempty(len) 14 x = []; 15 return; 16 end 17 18 % Find unique sequence lengths 19 uniqlens = unique(len); 20 21 % preallocate space for cell outputs of arrayfun 22 chunks = cell(1,length(len)); 23 24 % Now loop over the lengths, being sure to place the 25 % outputs in the right positions. 26 for ilen = uniqlens(:)' 27 idxlen = find(len==ilen); 28 chunks(idxlen) = arrayfun(@(xx,yy)xx:yy,... 29 start(idxlen),stop(idxlen),'UniformOutput',false); 30 end 31 % reassemble chunks into output 32 x = [chunks{:}];
The concept here is to loop over sequences that have the same length. At most, this for loop is the same length as the number of values in the initial start vector. If the length of the inputs is long, however, it might be advantageous to loop over sequences that generate the same length (commensurate size) output. Placing the output into cells allows me to not worry that I might be intermingling output vectors of different sizes. Once all the output sequences are collected in the cell array, I flatten it into a vector the same way I did with method 1.
Use Run-Length Decoding Idea
Here's an algorithm based on the idea of run-length decoding.
dbtype coloncatrld
1 function x = coloncatrld(start, stop) 2 % COLONCAT Concatenate colon expressions 3 % X = COLONCAT(START,STOP) returns a vector containing the values 4 % [START(1):STOP(1) START(2):STOP(2) START(END):STOP(END)]. 5 6 % Based on Peter Acklam's code for run length decoding. 7 len = stop - start + 1; 8 9 % keep only sequences whose length is positive 10 pos = len > 0; 11 start = start(pos); 12 stop = stop(pos); 13 len = len(pos); 14 if isempty(len) 15 x = []; 16 return; 17 end 18 19 % expand out the colon expressions 20 endlocs = cumsum(len); 21 incr = ones(1, endlocs(end)); 22 jumps = start(2:end) - stop(1:end-1); 23 incr(endlocs(1:end-1)+1) = jumps; 24 incr(1) = start(1); 25 x = cumsum(incr);
The idea is to expand the colon sequences out. I know the lengths of each sequence so I know the starting points in the output array. Fill the values after the start values with 1s. Then I figure out how much to jump from the end of one sequence to the beginning of the next one. If there are repeated start values, the jumps might be negative. Once this array is filled, the output is simply the cumulative sum or cumsum of the sequence.
Other Methods?
These methods I've shown, for vectorizing the notion of the colon operator, are ones that came very quickly. I haven't spent concerted time looking for other algorithms though I am positive some others, including some clever ones, probably exist. Can you help me and provide other possible algorithmic ideas? Please post them here.