Recently someone asked me to explain the speed behavior doing a calculation using a loop and array indexing vs. getting the subarray first.
Suppose I have a function of two inputs, the first input being the column (of a square array), the second, a scalar, and the output, a vector.
myfun = @(x,z) x'*x+z;
And even though this may be calculated in a fully vectorized manner, let's explore what happens when we work on subarrays from the array input.
I am now creating the input array x and the results output arrays for doing the calculation two ways, with an additional intermediate step in one of the methods.
n = 500; x = randn(n,n); result1 = zeros(n,1); result2 = zeros(n,1);
Here we see and time the first method. In this one, we create a temporary array for x(:,k) n times through the outer loop.
tic for k = 1:n for z = 1:n result1(z) = myfun(x(:,k), z); end result1 = result1+x(:,k); end runtime(1) = toc;
In this method, we extract the column of interest first in the outer loop, and reuse that temporary array each time through the inner loop. Again we see and time the results.
tic for k = 1:n xt = x(:,k); for z = 1:n result2(z) = myfun(xt, z); end result2 = result2+x(:,k); end runtime(2) = toc;
First, let's make sure we get the same answer both ways. You can see that we do.
theSame = isequal(result1,result2)
theSame = 1
Next, let's compare the times. I want to remind you that doing timing from a script generally has more overhead than when the same code is run inside a function. We just want to see the relative behavior so we should get some insight from this exercise.
disp(['Run times are: ',num2str(runtime)])
Run times are: 2.3936 1.9558
Here's what's going on. In the first method, we create a temporary variable n times through the outer loop, even though that array is a constant for a fixed column. In the second method, we extract the relevant column once, and reuse it n times through the inner loop.
Be thoughtful if you do play around with this. Depending on the details of your function, if the calculations you do each time are large compared to the time to extract a column vector, you may not see much difference between the two methods. However, if the calculations are sufficiently short in duration, then the repeated creation of the temporary variable could add a tremendous amount of overhead to the calculation. In general, you should not be worse off always capturing the temporary array the fewest number of times possible.
Have you noticed similar timing "puzzles" when analyzing one of your algorithms? I'd love to hear more here.