Recently someone asked me to explain the speed behavior doing a calculation using a loop and array indexing vs. getting the subarray first.
Suppose I have a function of two inputs, the first input being the column (of a square array), the second, a scalar, and the output, a vector.
myfun = @(x,z) x'*x+z;
And even though this may be calculated in a fully vectorized manner, let's explore what happens when we work on subarrays from the array input.
I am now creating the input array x and the results output arrays for doing the calculation two ways, with an additional intermediate step in one of the methods.
n = 500; x = randn(n,n); result1 = zeros(n,1); result2 = zeros(n,1);
Here we see and time the first method. In this one, we create a temporary array for x(:,k) n times through the outer loop.
tic for k = 1:n for z = 1:n result1(z) = myfun(x(:,k), z); end result1 = result1+x(:,k); end runtime(1) = toc;
In this method, we extract the column of interest first in the outer loop, and reuse that temporary array each time through the inner loop. Again we see and time the results.
tic for k = 1:n xt = x(:,k); for z = 1:n result2(z) = myfun(xt, z); end result2 = result2+x(:,k); end runtime(2) = toc;
First, let's make sure we get the same answer both ways. You can see that we do.
theSame = isequal(result1,result2)
theSame = 1
Next, let's compare the times. I want to remind you that doing timing from a script generally has more overhead than when the same code is run inside a function. We just want to see the relative behavior so we should get some insight from this exercise.
disp(['Run times are: ',num2str(runtime)])
Run times are: 2.3936 1.9558
Here's what's going on. In the first method, we create a temporary variable n times through the outer loop, even though that array is a constant for a fixed column. In the second method, we extract the relevant column once, and reuse it n times through the inner loop.
Be thoughtful if you do play around with this. Depending on the details of your function, if the calculations you do each time are large compared to the time to extract a column vector, you may not see much difference between the two methods. However, if the calculations are sufficiently short in duration, then the repeated creation of the temporary variable could add a tremendous amount of overhead to the calculation. In general, you should not be worse off always capturing the temporary array the fewest number of times possible.
Have you noticed similar timing "puzzles" when analyzing one of your algorithms? I'd love to hear more here.
Get the MATLAB code
Published with MATLAB® R2013a
Comments are closed.
8 CommentsOldest to Newest
This technique is often called “loop-invariant hoisting” or “loop-invariant code-motion” (http://en.wikipedia.org/wiki/Loop-invariant_code_motion)
Some modern compilers are able to automatically identify and optimize such cases, but apparently Matlab’s JIT/accelerator is not as smart. Maybe one day…
Just learned something, thanks!
There was recently quite a bit of discussion regarding some puzzling observations with indexing on Stackoverflow: (http://stackoverflow.com/questions/13382155/is-indexing-vectors-in-matlab-inefficient).
Thanks for mentioning the other post. I believe this situation is completely different (but I could be wrong). Here it’s a question of how many times you create the submatrix, 1 or n.
I often think of array indexing as effectively being a function call. In general you want to minimize the number of function calls. Just don’t over do it and let Matlab’s JIT do it’s thing for the most part.
This paves the way for a question I have been wanting to ask you:
Because the JIT has gotten so good at accelerating loops do you think that there should be less of an emphasis on vectorization of code? Recently I have been writing some low level functions that need to be very fast, and repeatedly I have found that a looped implementation ( assuming you can get all “if statements” out of the loop) are the fastest solution.
Also can you can you comment on the speed of sub array code which uses continuous sub indexes? I have found that vec(:)=vec(:).*3 is much faster that vec(1:10)=vec(1:10).*3 even if vec 100+ elements. This come up when you preallocate array space without knowing exactly how much you will use. And while vec(:)=vec(:).*3 work it just seems wasteful to me when I only need to do it to the first x elements.
We have made a concerted effort to improve the loop performance in MATLAB over the years. So I don’t think it’s necessary to always avoid them.
I do think you should be careful about programming to the performance details of any given MATLAB release. We are always looking for opportunities to improve performance, and you may well find that something that used to be the fastest way, while not degraded in performance, becomes less compelling if we manage to speed up a different strategy. I would always emphasize writing code that is readable, maintainable, and does what you mean for it do.
Nick – perhaps this is due to vec(:) being processed entirely in-place, whereas sub-indexing requires allocating a new memory block, and later deallocating it. It is also possible that JIT is able to vectorize the first but not the second. Both of these are educated guesses, and do not rely on anything official – I may well be wrong here.