Preparing to Run Code in Parallel
In a recent post, I talked about for-loops in MATLAB and how to optimize their use knowing how MATLAB stores arrays in memory. Today I want to talk about getting ready for parallel computation, specifically using parallel for-loops, via parfor.
En route to creating code suitable for running in parallel, sometimes we take code with a for-loop and simply replace it with a parallel loop, using parfor. That is, if we can't vectorize the code well first. This transformation from for->parfor works really well sometimes, but does not always work, and for very good reasons.
For example, you can't simply replace for with parfor if loop iterations are not completely independent. You can find out more conditions here. There is a notable exception to this rule, reduction variables, (a reduction variable accumulates a value that depends on all the iterations together, but is independent of the iteration order). The documentation for this is great. Here's a quick pseudo-code that would parallelize just fine, even though we reuse the variable s on both the right and left-hand sides. As long as the result is not numerically sensitive to the order in which the accumulation occurs, you will get the "right" answer here whether you use a for or a parfor loop.
s = 0;
parfor ind = 1:100
s = s + fun(data(ind));
end
Testing for Parallelizability
I doubt parallelizability is word, but I think you know what I mean!
One simple way to see if your loop might be suitable for parallelization is to try running the loop in another order to see if you get the same result. If so, it is a candidate for parfor. The easiest version of this might be running the loop backwards. If you get the same results as the standard sweep, you should be good to go. Instead of backwards, you can also try replacing the loop with the counter in random order; this can be achieved easily using randperm.
Note: clearly this calculation can be vectorized easily. That's not my focus here.
k = randn(100,1);
meank = mean(k);
tol = sqrt(eps);
qforward = zeros(size(k));
for ind = 1:100
qforward(ind) = k(ind) - meank;
end
meank = mean(k);
qbackward = zeros(size(k));
for ind = 100:-1:1
qbackward(ind) = k(ind) - meank;
end
meank = mean(k);
qrandom = zeros(size(k));
for ind = randperm(100)
qrandom(ind) = k(ind) - meank;
end
aretheyequal = isequal(qforward, qbackward, qrandom)
In case the answers aren't exactly equal, you could compare them in a way similar to this.
nearlyfb = norm(qforward-qbackward) < tol
nearlyfr = norm(qforward-qrandom) < tol
nearlybr = norm(qbackward-qrandom) < tol
So this calculation is ready to try with parfor.
qpar = zeros(size(k));
for ind = 1:100
qpar(ind) = k(ind) - meank;
end
aretheyequal = isequal(qpar, qforward)
What if I Have an Expression with Iteration Dependency
You may not always be able to parallelize your code. However, there are some conditions under which you can convert your code to take advantage of parallelization by refactoring your calculations a little Let me show you an example. In this case, I will do some of the accumulation after the loop is complete.
%% The Code Analyzer gives this first chunk of code two red error messages.
x = [17 zeros(1,99)];
k = [0 randn(1,99)];
parfor ind = 2:100
x(ind) = x(ind-1) + k(ind)^2;
end
%% the error messages from above code:
The PARFOR loop cannot run due to the way variable 'x' is used.
In a PARFOR loop, variable 'x' is indexed in different ways, potentially
causing dependencies between iterations.
Here's the for loop equivalent, no warning.
x = [17 zeros(1,99)];
k = [0 randn(1,99)];
for ind = 2:100
x(ind) = x(ind-1) + k(ind)^2;
end
xx = x;
And here's how you might parallelize it
x = [17 zeros(1,99)];
tmp = zeros(1,100);
parfor ind = 2:100
tmp(ind) = k(ind)^2;
end
x = cumsum(x+tmp);
aretheyequal = isequal(x,xx)
Thoughts?
Do you often work with large datasets or simulations where getting access to hardware suitable for running code in parallel helps your work scale? Did your code fall into any of the simple categories I mentioned in this post? I would love to hear from you about any issues you have wrangling your code into a form suitable for parallelization. Post your thoughts here.