Loren on the Art of MATLAB

parfor the Course 34

Posted by Loren Shure,

Starting with release R2007b, there are multiple ways to take advantage of newer hardware in MATLAB. In MATLAB alone, you can benefit from using multithreading, depending on what kind of calculations you do. If you have access to Distributed Computing Toolbox, you have an additional set of possibilities.

Contents

Problem Set Up

Let's compute the rank of magic square matrices of various sizes. Each of these rank computations is independent of the others.

n = 400;
ranksSingle = zeros(1,n);

Because I want to compare some speeds, and I have a dual core laptop, I will run for now using a single processor using the new function maxNumCompThreads.

maxNumCompThreads(1);
tic
for ind = 1:n
    ranksSingle(ind) = rank(magic(ind));
end
toc
plot(1:n,ranksSingle, 'b-o', 1:n, 1:n, 'm--')
Elapsed time is 22.641646 seconds.

Zooming in, youI can see a pattern with the odd order magic squares having full rank.

axis([250 280 0 280])

Since each of the rank calculations is independent from the others, we could have distributed these calculations to lots of processors all at once.

Parallel Version

With Distributed Computing Toolbox, you can use up to 4 local workers to prototype a parallel algorithm. Here's what the algorithm for the rank calculations. parMagic uses parfor, a new construct for executing independent passes through a loop. It is part of the MATLAB language, but behaves essentially like a regular for loop if you do not have access to Distributed Computing Toolbox.

dbtype parMagic
1     function ranks = parMagic(n)
2     
3     ranks = zeros(1,n);
4     parfor (ind = 1:n)
5         ranks(ind) = rank(magic(ind));  % last index could be ind,not n-ind+1
6     end

Run Parallel Algorithm and Compare

Let's run the parallel version of the algorithm from parMagic still using a single process and compare results with the original for loop version.

tic
   ranksSingle2 = parMagic(n);
toc
isequal(ranksSingle, ranksSingle2)
Elapsed time is 22.733663 seconds.
ans =
     1

Run in Parallel Locally

Now let's take advantage of the two cores in my laptop, by creating a pool of workers on which to do the calculations using the matlabpool command.

matlabpool local 2
tic
   ranksPar = parMagic(n);
toc
To learn more about the capabilities and limitations of matlabpool, distributed
arrays, and associated parallel algorithms, use   doc matlabpool

We are very interested in your feedback regarding these capabilities.
Please send it to parallel_feedback@mathworks.com.

Submitted parallel job to the scheduler, waiting for it to start.
Connected to a matlabpool session with 2 labs.
Elapsed time is 13.836088 seconds.

Comparison

Did we get the same answer?

isequal(ranksSingle, ranksPar)
ans =
     1

In fact, we did! And the wall clock time sped up pretty decently as well, though not a full factor of 2.

Let me close the matlabpool to finish off the example.

matlabpool close
Sending a stop signal to all the labs...
Waiting for parallel job to finish...
Performing parallel job cleanup...
Done.

Local Workers

With Distributed Computing Toolbox, I can use up to 4 local workers. So why did I choose to use just 2? Because on a dual-core machine, that just doesn't make lots of sense. However, running without a pool, then using a pool of size 1 perhaps, 2 and, and maybe 4 helps me ensure that my algorithm is ready to run in parallel, perhaps for a larger cluster next. To do so additionally requires MATLAB Distributed Computing Engine.

parfor and matlabpool

matlabpool started 2 local matlab workers in the background. parfor in the current "regular" matlab decided how to divide the parfor range among the 2 matlabpool workers as the workers performed the calculations. To learn a bit more about the constraints of code that works in a parfor loop, I recommend you read the portion of documentation on variable classifications.

Do You Have Access to a Cluster?

I wonder if you have access to a cluster. Can you see places in your code that could take advantage of some parallelism if you had access to the right hardware? Let me know here.


Get the MATLAB code

Published with MATLAB® 7.5

34 CommentsOldest to Newest

Loren,
One area on which I am unclear is whether one benefits from multiple cores when one doesn’t have the distributed computing toolbox. I have read the Cleve’s corner on the subject of multi-threading, but I find it difficult to determine how much benefit is achieved without the toolbox. The other question which I have is whether running 64 bit MatLab actually accelerates any computations significantly, or it the primary difference is in the increased memory space?

Thanks,
Dan

Dan-

You can turn multithreading on in the preferences panel starting in R2007a. If you have a multicore machine, multithreading should affect pointwise calculations (but some will not improve over the JIT performance MATLAB already has) and those using the BLAS. You might try running the demo entitled “Multithreaded Computation” (multithreadedcomputations.m) to get more of a feel for that on your system.

I think 64-bit performance depends on the details of what you are doing, but the main benefit is the increased memory space and size of arrays that can be used.

–Loren

Many of our computers as well as the computers of our customers are either multi-core or multi-processors with multi-cores.

For our specific situation, if we were able to compile this ability with the MATLAB compiler, I would purchase the toolbox. Given the developers this ability would not benefit us since ultimately we need to distribute our software developed with MATLAB.

– Stephen

Shouldn’t the start up time of the matlabpool be included?
Ignoring the overhead is almost like cheating?

Should be:

tic;
matlabpool local 2
ranksPar = parMagic(n);
toc

Abel—I think a short-running, one-time-only calculation is not a very interesting application of parallel or cluster computing. Ignoring the pool startup time gives you a better idea of the potential benefit for long-running calculations, or for running many calculations over the course of a single session.

Hi,

I have the Distributed Computing Toolbox up and running on my quad-core machine. For the moment this is all the parallel or distributed computing power i require. I will however be looking to up my computation and will be getting a dual quad-core machine sometime in the near future. The local scheduler bundled with the toolbox can handle up to 4 local workers. How can I take advantage of the other 4 processors in a dual quad-core system?

Do i need the MATLAB Distributed Computing Engine and will this then automatically recognise the 8 processors?

Or is this something that will come in an update to the distribution?

My current set-up
MATLAB Version 7.5 (R2007b)
Distributed Computing Toolbox Version 3.2 (R2007b)
Optimization Toolbox Version 3.1.2 (R2007b)
Statistics Toolbox Version 6.1 (R2007b)

System:
Intel core 2 quad Q6600 @ 2.40 GHz
4 GB RAM

many thanks in advance

Andrew

Andrew,

To access the additional 4 cores, you will need the MATLAB Distributed Computing Engine. You will have to set up a configuration file for the extra cores to be used. The documentation for setting things up should be enough to get you going, though you can certainly contact technical support if you have issues. Once you set things up, you won’t need to repeat that task again.

–Loren

Instead of MATLAB Distributed Computing Engine, can one use “MPI” to take advantage of 8 cores on a single machine?

On that note, I found some coverage on MPI as a scheduler on mathworks.com. Wonder what’s the difference between “parallel” and “distributed” jobs as in the following?

“The mpiexec scheduler is intended as a launcher for parallel jobs and supports only parallel jobs. It does not support distributed jobs.”
http://www.mathworks.com/products/distribtb/supported/sched/mpiexec.html

Haoz,

The Distributed Computing Toolbox uses MPI for some of its functionality, such as the parallel math, and it also exposes MPI-like operations: labindex, numlabs, labSend, labReceive, labSendReceive, labBroadcast and labProbe. As Loren explained in the article, the toolbox allows one to use these operations with up to 4 MATLAB workers on a single machine.

The MATLAB Distributed Computing Engine allows one to scale higher than just 4 workers, and allows the use of more than just one machine. It also supports multiple schedulers, including what might better be called “process launchers”, such as the mpiexec command.

Your conjecture is correct: You could use MPI directly to take advantage of 8 cores on a single machine without using either the Toolbox or the Engine, but then you would have to write a lot of parallel C or Fortran code from scratch. If you want to use any of the higher level constructs such as parfor, distributed arrays, or labSend/labReceive, you need to use the Toolbox, optionally with the Engine.

The difference between “distributed” vs “parallel” jobs has also been described as “embarrassingly parallel” vs “parallel”. I.e. the former consists of multiple tasks that have no dependencies between them and can be executed in any order. The latter require inter-process communication.

Narfi

Thanks, Narfi. Following up..
Is it possible to use MPI with Distributed Toolbox (but not Engine) to distribute jobs to (a) more than 4 cores on a single machine (b) more than 1 machines?

Haoz,

No, neither is possible without the use of the engine. Do you have a desktop machine with more than 4 cores?

Narfi

I’m newcomer on the parallel computing. I develop matlab codes with a lot of for loops (> 1000 triple fors with matrice interelations within loops and functions). I optimised – vetorised the code as better as i could…no more. The code is really complicated and requires (in an old 3.4GHz Intel single core) almost 15 hours to complete so my productivity is low. Would it be wise to purchase a Core 2 quad and see significant runtime improovement [due to multicore] without totaly altering my code? At least some minor changes to import parallelisation would be ok, but that’s all. What matlab version and toolboxes should i purchase to take advantage of the new hardware? Realy thank in advance for your help!
Than

Than-

Without the details of YOUR code, it is impossible to give good advice. A core 2 quad will help some operations and not others. I mention the products above in the article. In addition, you might read this portion of the documentation:

http://www.mathworks.com/access/helpdesk/help/techdoc/rn/bq08o1n-1.html

and this portion:

http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/brdo29n-1.html

After that, you might contact technical support.

–Loren

I am very interested in parallel matlab. Parfor is very beautiful and handy! I hope you can give us more instructions on the use of the parallel toolbox 3.2.

By the way, could you make the articles on your blog in pad files? Then we can download them and print out for careful study.

Thanks for your parallel matlab!

Jiangming-

Thanks for the comments. There are not plans to make the blog available in other formats.

–Loren

I was wondering how MATLAB’s new parallel capabilities interact with external programming interfaces.
Suppose I am running a loop in matlab and inside the loop I call a mex file containing some C code.
Would I be able to do this if I paralelize the matlab loop with matlabpool+parfor in a multicore computer or cluster?

Thnaks for your insights.

Juan,

There is no difference between the constraints that parfor puts on MATLAB files and MEX files. The only thing to be aware of is that the MEX file needs to be compiled for the OS type of the workers.

In particular, if the MATLAB client and workers are running the same OS, there is no difference between calling regular MATLAB code and MEX files inside parfor.

Best,

Narfi

I am interested in the issue of separating i) parfor overhead from ii) network costs. Let me first state that my initial experience with the parallel toolbox is quite positive. However it is clear that fine-grained problems are unsuitable for parfor implementation.

I have a data-parallel algorithm for which the maximum theoretical speedup (call this MTS) can easily be calculated for a well-defined test problem. I run this on a Intel Xeon+Infiniband cluster with a Fortran and MPI implementation and the observed speedup is close to the MTS.

Next, I implement this in Matlab. The algorithm is quite simple and essentially requires one parfor loop. I run it on an Opteron 275 with 4 cores. It turns out that there is essentially no speedup. Next I therefore increase the problem size, i.e. the problem more coarse-grained (the work inside the parfor loop). Now the Matlab implementation can reach the MTS.

Playing around with the code I get that parfor overhead can be measured in hundreds of a second (somewhere between 0.05-0.15 seconds). Does this appear reasonable?

Now, first I am interested in separating the parfor and Opteron/network effects on scalability. Is there anything written on the issue of parfor overhead? Or benchmark code for testing the system performance?

Second, is there an alternative way to do this in Matlab, e.g. using MPI instead of parfor and would it help? Clearly if parfor can be used MPI can as well, but does it help performance?

Best regards,
Ingvar

Ingvar,

Your quoted overhead numbers of between 50ms and 150ms for a parfor loop seem reasonable. Given the code analysis and data transfer that needs to happen we felt this was a reasonable trade-off for loops that we expected to take between seconds and hours to run.

To separate the parfor overhead from the network overhead you can try a few different things. Firstly, run the code with no matlabpool open, which will run the code in the local MATLAB session. Next run the same parfor with a matlabpool of size 1 on the local machine. The difference between these 2 times will be the parfor overhead. If you are on a multi-cored machine, you can try the same parfor on local matlabpools of sizes 2 – 4. This will show the best possible scaling for the particular parfor. Next you should move to using a remote matlabpool with varying sizes, which will give you the overheads for parfor and the network.

I’m afraid there isn’t a document on the overheads of parfor as the numbers vary wildly depending on the hardware and network behaviour.

To answer your question about doing this another way, I need to know if the individual iterates in the parfor loop are all of similar computational complexity. It should be noted that the parfor language construct is designed to deal with varying complexity loops and it thus a dynamic task sharing loop. There is another construct written for i = drange(1,N),…,end which when run inside a parallel job will partition the iterates 1:N statically between the available labs in the parallel job. There is no communication needed to carry out this loop, and each lab in the parallel job will end up with its particular set of iterates filled in. If used in conjunction with distributed arrays this construct can be used to carry out this sort of parallel programming.

What if I don’t have a proper cluster but I have a few machines on my network that I could use to run simulations overnight. How can I setup workers on those computers?

Yannick,

The MATLAB Distributed Computing Server allows one to use more than just one machine. The server product does not have to run on dedicated machines, but it does simplify the management. It also avoids the loss of computations that occur in the morning when you have to kill the simulations that are running on the desktop machines.

Best,

Narfi

The information details that

- “The job manager is intended primarily for groups working with small- to medium-sized clusters.” Can you define small and medium for me? I want to use my company network overnight with all those sleeping computers, is 128 computers too large?

- “By default, the job manager runs jobs in the order in which they are submitted”. Can this default be changed to identify an “alpha” submitter whose jobs should take priority an anyone elses? (This would help me a lot in selling this internally to the simulation guy).

- What other options are possible with the sheduler? Time when some workers can be used an not used? Can a computer being used as a worker decide to disconnect itself from the cluster (for example somebody coming back to his/her computer, and not wanting to be running sims while they draft an email!).

- Is it possible to get a demo version of the Distributed Computing Server to test out all these things?

The job manager fully supports 128 nodes. However, it does not have any of the other features that you mention, namely user-based priority and time-based access and/or idle-detection. You may find these features in the other schedulers that we support:
http://www.mathworks.com/products/distribtb/supported/sched/

If you are interested, the MATLAB Distributed Computing Server page has a “Downloads & Trials” link, so you can indeed test these things in your environment.

Best,

Narfi

Thanks Narfi, but what happens if someone kills the process on a worker, how does the server handle that? Does it realize that and resends the iteration to another worker or just ignores the fact that there is no response and no value is assigned for this iteration or does it hang waiting for a response from the worker?

I presume that by “iteration”, you mean a task in a job. Currently, the job manager does not re-run tasks that fail due to workers being killed or other system errors like that, and instead simply marks those tasks as failed. However, we have been actively working on this, so you can expect improvements in this area fairly soon.

Best,

Narfi

Ok, thanks… more questions! I received an evaluation license for the Parallel Computing toolbox and the distributed computing server, I’m attempting to test the functions thorougly and I have the following questions on the parfor loop (and the Product Help didn’t bring an answer):

a) I substituted parfor for a for loop in a script. A function is called in this loop. Somehow parfor does not look for the function in the same path as the for loop and was calling another version of that function in another directory. How is parfor looking for a function differently than for?

b) I have a struct that I declare outside of the for loop but one field is assigned within the loop. example
s.f1=1;
s.f2=2;
for i=1:10
s.f3=i
output(i)=myfunction(s);
end

This works fine in a for loop, but the parfor loop seems to erase the comple s struct instead of just the field f3. Is this normal?

c) I can’t seem to pass in a cell array of strings which is not assigned inside the loop, it seems to get erased as well. Example
Mystringarray; %cell array of strings
for n=1:length(Mystringarray)
s.f3=Mystringarray(n);
end

Now when i use parfor, it seems like Mystringarray does not exist after the first iteration of the parfor.

Thanks!!!

The MATLAB path problem you are experiencing is a little bit tricky because it depends on your exact setup, so I recommend you ask our technical support for help with these and other questions that you may have.
The simplest general solution for problems with parfor and the way it manipulates variables is to move the loop body into a separate function. E.g. in case b), the function would accept i and s as input arguments. If this doesn’t work for you, I recommend you contact technical support.

Best,

Narfi

I have successfully computed the svd(…, ‘econ’) of a matrix on a quad core machine by using the interactive pmode. The instructions to do this were:
>> A = rand(1000);
>> pmode start local 4;
>> pmode client2lab A 1:4;

And in the parallel command window:
P>> A_dist = distribute(A);
P>> [u, s, v] = svd(A_dist, ‘econ’);
P>> u_comb = gather(u);

Back in the matlab command window
>> pmode lab2client u_comb;

I obtained a significant speedup (6,8 secs compared to 17 secs). However, I would like to do this non-interactive from an M-file.

Ps: Doing more or less the same with the matlabpool command clearly indicates that no speedup is obtained. So I concluded that matlabpool only works for the parfor loop, is this correct?

Can anybody guide me in the right direction?

Thanks you,
Paul

However, I now want to

In reply to my own question: I’ve managed to send command to the (interactive) pmode, thus making it non-interactive. Example code below:

function par_svd(input_variable)
% Executes a distributed calculation of the svd

NR_OF_LABS = 4;

% opening pmode on 4 labs
pmode (‘start’, ‘local’, NR_OF_LABS);

% Distributing the array A over the labs
% Copy A to the labs
pmode(‘client2lab’, input_variable, strcat(’1:’, int2str(NR_OF_LABS)));
% Distribute on the labs
iRunCmdOnLabs(strcat(‘A_dist = distribute(‘, input_variable, ‘);’));

% Calculate the svd in a distributed manner
iRunCmdOnLabs(‘[u_dist, s_dist, v_dist] = svd(A_dist , ”econ”);’);

% Gather the data back into every lab
iRunCmdOnLabs(‘u = gather(u_dist); s = gather(s_dist); v = gather(v_dist);’);

% Clear intermediary results
iRunCmdOnLabs(‘clear u_dist; clear s_dist; clear v_dist; clear A_dist’);

% Send results back to the client
pmode lab2client u 1;
pmode lab2client s 1;
pmode lab2client v 1;

% Clean up the parallel labs
pmode cleanup;

end

function labs = iGetLabs()
%iGetLabs Can be called on the client to get the labs object.
if ~iIsOnClient()
error(‘Cannot execute par_svd from labs’);
return;
end
try
session = com.mathworks.toolbox.distcomp.pmode.SessionFactory.getCurrentSession;
labs = session.getLabs();
catch
error(‘distcomp:pmode:NotRunning, cannot execute par_svd’);
end
end

function iRunCmdOnLabs(cmd)
%iRunCmdOnLabs Send a command to the labs
session = com.mathworks.toolbox.distcomp.pmode.SessionFactory.getCurrentSession;
if isempty(session)
error(‘distcomp:pmode:NotRunning’, …
‘Cannot execute par_svd when pmode is not running.’);
end
% Error messages will only be displayed in the main MATLAB command window, and
% the command will only be executed in the MATLAB client when it is idle.
fprintf(‘Sending command %s to the MATLAB labs for evaluation.’, cmd);
labs = iGetLabs();
labs.evalConsoleOutput(cmd);
end

function onclient = iIsOnClient()
onclient = ~system_dependent(‘isdmlworker’);
end

If anybody know a more elegant solution, please let me know!

Ps: Use the code at own risk. One method (labs.evalConsoleOutput(cmd);) on the labs was undocumented, but seems to work. Have fun, and save time!

Best regards,
Paul

Paul:

Using undocumented API is a surefire way to get into massive trouble.

A much easier way is to use the documented “Parallel Job” API. See documentation: http://www.mathworks.com/access/helpdesk/help/toolbox/distcomp/bqur73g.html

Here is a quick recipe for going from interactive to non-interactive way:
1. Once you have tested your algorithm in “pmode”, create a MATLAB function file using the commands/function calls that worked.

1a. An easy way to do this is to enable command history in “pmode” window, select the commands that worked (CTRL+Click for multiple commands on Windows), right click and choose “Create M-File”.

2. Make it a MATLAB function, i.e., add “function” etc. at the top of the file .

3. Create a parallel job (createParallelJob), add a task to it (createTask) and supply the MATLAB function you created as an input to the task creation function.

4. The createParallelJob function returns a job object which you can submit and later retrieve results, check status etc.

4a. Add FileDependencies to this job object if your function uses other functions created by you.

I noticed you are using the ‘local’ configuration. So here is what you can do:

jm = findResource('scheduler', 'configuration', 'local'); 

and then follow the regular steps for a parallel job. For example:

pjob = jm.createParallelJob();
t   = job.createTask(@myfunction, numOutputs, {input1, input2})); 
pjob.submit;
%optionally: pjob.waitForState('Finished'); o = pjob.getAllOutputArguments; celldisp(o); 

The last few lines in my previous post should actually read:

pjob = jm.createParallelJob();
t    = job.createTask(@myfunction, numOutputs, {input1, input2})); 
pjob.submit;

%% optionally: pjob.waitForState('Finished'); 
o = pjob.getAllOutputArguments; 
celldisp(o); 

Dear Gaurav,

Thank you for your swift reply. And I surely don’t like the trouble ahead… certainly when it come to my research :)

However, I’ve followed your instructions, and benchmarked an svd(A, ‘econ’), with A = rand(1000);

The single core (directly in command window) takes 29 secs. The “non-interactive pmode” approach (my example) takes 13 secs. The “Parallel jobs” approach (your example) takes 32 secs (probably around 3 secs startup time).

Maybe I haven’t distributed my data in the correct manner, so I’ve included my test_par_svd function:

function u = test_par_svd (A)
A_dist = distribute(A);
[u_dist, s_dist, v_dist] = svd(A_dist, ‘econ’);
u = gather(u_dist);
s = gather(s_dist);
v = gather(v_dist);
clear u_dist; clear s_dist; clear v_dist;
clear A_dist;
end

The results “o” also contains 4 times the “u” result from the function above, which kind of indicates that the labs cannot distribute the task among themselves, and every lab calculates the svd of the whole matrix in parallel, but not via a distributed parallel LAPACK SVD. Since my example uses less time, it surely is possible on my Intel MKL lib (version 9).

Please also note that I find it difficult to translate the easy example on parallel jobs (matlab docs) towards the use of an svd function, for which I can’t control the calculation (function of MKL lib).

I’m now strugling with getting the “svds” function to work. MKL doesn’t support this, so I’ve reworked my calculation via the “eigs” function (of the A*A’ matrix), which gives the same results. But apparently this eigs function isn’t supported in MKL 9 (it is in MKL 10). Does anybody know how to change the libs to MKL 10? (I’ve tried a couple of ways described in the matlab docs, but I really need the specific changes to blas.spec (or env. vars.) to get this to work I guess)

Best regards, and thanks a lot already for your time and effort,
Paul

Hi there, I’ve recently invested in the parallel computing toolbox and a quadcore cpu in the hope of being able to quickly speed up the execution of ‘embarrassingly parallel’ code that repeatedly calls a hand written mex function with different input each time. Response to reply 17 indicates this should work. I am trying to use parfor. Everything works fine using a for loop in local mode, but I get segmentation violation errors when using parfor (I get no pre-run time parfor related errors issued in the editor). Two questions:

(i) how do I debug code executing within the parfor loop, preferably both in Matlab and in C (I use ms visual studio 2005)? There does not seem to be a discussion of this in the relevant literature, but maybe I missed it.

(ii) Is it possible that the problem stems from the use of low level gateway routines? e.g. mexGetVariablePtr() to read Matlab memory locations as I’m not certain how the mex code knows to look for worker copies as opposed to the originally stored copy (and I’m not certain this should make a difference without debugging). I avoid using the mexPut… routines but have too many inputs (>50) for the fetching of data, and prefer to pass pointers to large constant arrays rather than copying data to save time and memory.

Cheers,
Steve

Steve W:

It should be possible to debug local workers simply locating the MATLAB process in task manager, and right-clicking on the worker process and selecting the “Debug” menu item. This will bring up the Visual Studio “just in time” debugger dialog. If you build your mex file with “-g”, you should then be able to set breakpoints within your mex function source code. (I just verified that I could do this successfully, so let me know if you have problems here)

Having said that, I would not expect mexGetVariablePtr() to work as expected, as that violates workspace transparency – in other words, it attempts to read stuff from another workspace. In the same way, “evalin(‘caller’, …)” is disallowed.

When executing the body of a parfor loop, we only transmit from the client to the workers those variables that we can see (by analysing the code) are read from within the loop. This allows us to avoid sending the whole contents of the workspace. So, to work effectively with parfor, the variables used inside the body of the loop must be passed as arguments to your mex file.

One possible approach to avoiding sending large constant data repeatedly is to have the workers cache the data inside a persistent function workspace.

Cheers,

Edric.

These postings are the author's and don't necessarily represent the opinions of MathWorks.