Practical Advice for People on the Leading Edge

MATLAB’s High Performance Computing (HPC) and ‘Big Data’ datatypes

I'm currently preparing for SC22, my first supercomputing conference as a MathWorker. While making plans with old friends from other High Performance Computing organisations who don't know MATLAB so well, I'm often asked 'HPC in MATLAB....What else is there besides parfor?'
parfor is great, allowing users to go from a serial loop to a parallel one with the addition of only 3 letters, but it's just the beginning of the possibilities for parallel and scalable programming in MATLAB.
Today, I wanted to give a brief survey of what I think of as MATLAB's HPC data-types which allow users to make use of GPUs, distributed memory and work with out-of-core data.

gpuArrays - GPU programming made easy

GPUs, or Graphics Processing Units, can drastically accelerate certain types of computation. MATLAB currently only supports NVIDIA GPUs which are traditionally programmed using a language called CUDA. You can code in CUDA if you want to, and interface to MATLAB using CUDA mex functions....but there's an easier way.
Say you have an array, X
X = [1,2,3];
transfer this to the GPU with
G = gpuArray(X);
Any calculation you subsequently do on the gpuarray G will be automatically performed on the GPU.
Gsq = G.^2; % Performed on the GPU
Gsin = sin(G); % Performed on the GPU
As with the other 'HPC datatypes', you use gather to transfer a result back to a 'normal' MATLAB matrix
Xsq = gather(Gsq);
Xsin = gather(Gsin);
gpuArrays require Parallel Computing Toolbox.

dlarray - Specialised objects for Deep Learning training

dlarray is an array type for doing deep learning. It supports Automatic Differentiation and can contain 'normal' arrays that get evaluated by the CPU as well as gpuArrays. How to use dlarrays for deep learning is beyond the scope of this article so I'll just show to create them and link to the documentation.
X = randn(3,5);
dlX = dlarray(X) % Anything you do with dlX will be evaluated using the CPU
dlX =
3×5 dlarray -0.1472 -0.5046 0.6487 -0.4711 0.3018 1.0078 -1.2706 0.8257 0.1370 0.3999 -2.1237 -0.3826 -1.0149 -0.2919 -0.9300
and if we want our dlarray to be created on the GPU we do
gX = randn(3,5,'gpuArray');
dlgX = dlarray(gX) % Anything you do with dlgX will be evaluated using the GPU
dlgX =
3×5 gpuArray dlarray -0.7891 1.2915 0.8958 -1.6692 1.1301 -1.0132 0.2066 -0.3653 0.8798 2.0214 1.2300 -1.1879 -1.2931 -0.8332 0.5199
You need Deep Learning Toolbox to make use of dlarray.

Tall arrays - For when you've got more rows than memory

Tall arrays are useful for when you have tabular data with millions, or even billions of rows. So many rows, in fact, that you could never load them all into memory at once.
Imagine that all of your data is stored in .csv files in a folder called mydata. We start off by creating a datastore that points to all of these files.
ds = datastore("mydata\*.csv");
Note that we have not loaded all of the data into memory, we have merely created an object, ds, that represents and points to the data. We can see its type with
ans = ''
To operate on this datastore we now create a tall array from it. Since I have Parallel Computing Toolbox installed, I'll also start a local pool so all tall array operations are performed in parallel. Without a Parallel Computing Toolbox install everything would be performed serially.
delete(gcp("nocreate")) % Ensure that there is no exisiting parallel pool
Starting parallel pool (parpool) using the 'processes' profile ... Connected to the parallel pool (number of workers: 8).
tt = tall(ds)
tt = M×4 tall table Things OtherThings BigThings SmallThings _______ ___________ _________ ___________ 0.85071 0.61257 7.3864 0.076903 0.56056 0.98995 5.8599 0.058145 0.92961 0.52768 2.4673 0.092831 0.69667 0.47952 6.6642 0.058009 0.58279 0.80135 0.83483 0.0016983 0.8154 0.22784 6.2596 0.012086 0.87901 0.49809 6.6094 0.086271 0.98891 0.90085 7.2975 0.04843 : : : : : : : :
This shows a preview of the data. Note that the height of the tall array is unknown (MATLAB just tells us that there are 'M' rows) because we have not yet parsed through all of the files. We can operate on this tall array as if it were any other table in MATLAB.
Let's get the mean of the BigThings column
BigThingsMean = mean(tt.BigThings)
BigThingsMean = tall double ? Preview deferred. Learn more.
This demonstrates another difference between tall arrays and normal, in-memory MATLAB arrays - Deffered evaluation. Everything remains unevaluated until you explicity request that the calculation is performed. When you make multiple operations on tall arrays, this behaviour allows MATLAB to minimise the number of passes through the data when asked to evaluate the final results.
We request evaluation using the gather command. Since I have Parallel Computing Toolbox, it is automatically done in parallel using the local process pool.
Evaluating tall expression using the Parallel Pool 'processes': - Pass 1 of 1: Completed in 4.3 sec Evaluation completed in 5.2 sec
ans = 4.6679
Tall arrays work by only ever having a resonable-sized chunk of the data in memory at once and can also make use of parallel infrastructure such as parallel workers on your local computer, traditional HPC clusters or even Spark clusters.
Tall arrays are available in base MATLAB but can be accelerated and scaled using Parallel Computing Toolbox and scaled even further with MATLAB Parallel Server.

Distributed arrays - Spread your arrays across multiple nodes of a HPC cluster

If your matrix is too large to fit onto one node of a HPC cluster, use a distributed array to spread it across multiple nodes.
Just as you can use tall(foo) to create tall arrays, you can use distributed(foo) to create distributed arrays. With a distributed array, the result is in memory but that memory is shared across multiple workers. Ensure that those workers are on multiple nodes of a HPC cluster and fully distributed, parallel computing is yours.
Our first example creates a distributed array A from a normal array X which isn't particulalry useful since if the array fits into memory of a single machine, there isn't much point in using distributed arrays. We'll see more sophisticated, and useful, methods of creating distributed arrays later.
% Created distributed array from normal array X
X = rand(4);
A = distributed(X);
You can work with this variable A as one single entity, without having to worry about its distributed nature. The code for computing a singular value decomposition, for example, is identical to how we'd do it for a normal array.
distres = svd(A); % Computing the svd of the distributed array, A
The result is still a distributed array
ans = 'distributed'
When you are ready to bring the results back into client memory, use the gather function, just as you would for a tall array. You have to be careful here though to ensure that we only request results that will fit into our client machine's memory.
% Bring back from cluster
res = gather(distres)
res = 4×1
1.8891 0.7206 0.6445 0.2742
This is now a normal MATLAB array.
ans = 'double'
Here, I've shown the creation of a distributed array from a normal MATLAB array which, when you think about it, is a bit pointless other than for prototyping purposes. If you can fit X into memory, there is nothing to be gained from using distributed arrays instead of normal arrays.
We can also create distributed arrays directlty on the workers using functions such as rand,randn and zeros. This way, the overall matrix never has to fit into memory of the client. For example,
distRandom = rand(5,'distributed');
ans = 'distributed'
Often, the creation of a distributed array is the most difficult part of a distributed computing workflow! Other methods of creation include using spmd, codistributed arrays (see next section) and datastore. Refer to the documentation for more details.
Over 500 functions are supported for distributed arrays including the vast majority of linear algebra operations (Note for the HPC specialists: we use an implementation of the scalapack library behind the scenes for some of this).
Distributed arrays require Parallel Computing Toolbox as a minimum but you also need MATLAB Parallel Server to make use of multiple node operation.

Codistributed arrays - Advanced moves for distributed array users

distributed arrays are actually a convenient user interface to a lower-level data type called codistributed arrays. The way to think about it is this: A distributed array is an abstraction referring to the entire distributed matrix that you can use without worrying about its distributed narture. The portion of the array that exists on each worker is a codistributed array.
The explicit use of codistributed arrays is a rather advanced move as they are almost always used within spmd blocks which are themselves more advanced parallel constructs than used by many users.
delete(gcp("nocreate")) % Ensure that there is no exisiting parallel pool
parpool("Processes",2); % Create a pool with 2 workers
Starting parallel pool (parpool) using the 'Processes' profile ... Connected to the parallel pool (number of workers: 2).
Now we have a pool, we can run an spmd block
spmd % Every worker runs the contents of the spmd block
format compact
A = zeros(80,1000);
D = codistributed(A)
fprintf("The class of D on this worker is %s",class(D))
Worker 1: This worker stores D(:,1:500). LocalPart: [80x500 double] Codistributor: [1x1 codistributor1d] Worker 2: This worker stores D(:,501:1000). LocalPart: [80x500 double] Codistributor: [1x1 codistributor1d] Worker 1: The class of D on this worker is codistributed Worker 2: The class of D on this worker is codistributed
fprintf("The class of D outside the spmd block is %s",class(D))
The class of D outside the spmd block is distributed
The above is a somewhat silly example since each worker creates the entire array A before using the codistributed function to spread it across workers. However, it serves the purpose of showing what's going on behind the scenes. Outside of the spmd block on our client MATLAB, D is a distributed array that refers to the entire matrix. Inside the spmd block, and hence on each worker, D is a codistributed array. We could get at the portion of D stored on each worker using the getLocalPart() function inside an spmd block.
Further discussion of codistributed arrays are beyond the scope of this article but essentially you use them for finer control over your distributed arrays. You could, for example, do something unique with the part stored on each worker or change how a distributed aray is distributed across workers using codistributor
So there you have it, a quick survey of some datatypes in MATLAB that facilitate High Peformance Computing.
  • print


To leave a comment, please click here to sign in to your MathWorks Account or create a new one.