Direct submission to HPC clusters from MATLAB
More Compute, More Problems
When it comes to High Performance Computing resources, I'm a lucky guy. I've got a fairly decent, 8 core desktop along with guest access to a reasonable number of academic HPC clusters around the world. Since I try to be a courteous guest, I make use of a healthy cloud budget when I need to do some heavy computations and use all of the major cloud providers.
In the old days, access to such a diverse array of compute was as much of a curse as it was a blessing. Many systems use different schedulers, for example. Even when two systems have the same scheduler (SLURM usually), they have different module files, different authentication procedures and different file system layouts among other things. Making use of these computational riches can be a burden indeed.
Then you have a diverse array of access methods. From the classic ssh/sftp command line interface to more modern things like OpenOn Demand, ThinLinc or Jupyter notebooks. It's all good stuff but it's a lot of messing around with an array of (admittedly fascinating) technologies when often all you want to do is get your results more quickly with as little fuss as possible.
Wouldn't it be nice if I could interact with all of these HPC and cloud machines without ever leaving MATLAB? No Linux, no schedulers and no shell scripting -- just MATLAB? Well you can, and I'll show you how.
Submit to all the things, direct from MATLAB
Imagine I have a function:
[result1,result2] = bigComputation(N)
It's a computation and it's big! The single input argument is an integer that defines just how big and it has two output arrays. Obviously it also runs in parallel because this is a HPC post.
OK, so here's how I might run that function on my various machines using MATLAB. First, we define the problem and create objects representing my compute resources.
% Define how big a computation I want to do
N = 1e8;
% Create parallel cluster objects representing my various computational resources
myMachine = parcluster("Processes"); % My local PC
onPremHPC = parcluster("University of Bantshire HPC"); % A SLURM based cluster
cloudHPC = parcluster("MyAWS Cluster"); % A MATLAB Cloud Center Cluster running on AWS
- The Processes profile is one of the MATLAB defaults and was there from when I installed. It points to my local machine while the other two clusters involved some configuration.
- The onPremHPC cluster profile was given to me by the system administrator of the SLURM-based cluster who had configured their machine using the MATLAB SLURM plugin on GitHub along with a little help from our HPC Support team.
- I was able to import the cloudHPC profile after configuring an AWS cloud cluster using MathWorks Cloud Center.
The profile configuration only had to be done once. One day I'll tell you how it's done but I did it so long ago I've forgotten the steps, and I need this post published in time for International Supercomputing 2023 (ISC23), so today I'm just showing you the day-to-day workflow.
All that remains is to submit the function using the MATLAB batch command. The structure of which is
batch(clusterObject, @functionToRun, numberOfOutputArguments,{inputArguments},options)
The only option I'm going to use today is Pool which requests a specific number of workers on either my local machine or the remote resource.
% Submit to local machine
LocalJob = batch(myMachine,@bigComputation,2,{N},Pool=4); % Submit to a pool of 4 workers on my local machine
%Submit to SLURM cluster
onPremJob = batch(onPremHPC,@bigComputation,2,{N},Pool=31); % Submit to a pool of 31 workers on a SLURM HPC cluster
Since I'm running the same computation on both machines, these two calls to batch are almost identical but what's happening behind the scenes is very different!
The submission to my local machine is just going to run it on a local parpool almost exactly as if I had run the function directly in the command line. The main benefit of running parallel jobs locally this way is that it is non-blocking. While 4 cores service the job in the background, I can use my remaining cores to work with MATLAB interactively.
The submission to the SLURM cluster is a very different beast. Behind the scenes, it first logs into the HPC cluster; giving me any authentication challenge you might expect. Two Factor Authentication is not a problem. Then, it compresses my code, transfers it to the cluster via sftp, creates and submits a SLURM job to the system....all according to the policies laid out by the system administrator. I didn't need to worry about any of this, I just used the line above.
The submission to the AWS cluster has an extra step. I pay for it by the minute so I keep it switched off to save money. As such, I have to start it first!
start(cloudHPC)
This kicks off the start-up sequence of the cloud cluster and returns almost immediately, allowing you to continue interactive use on your local machine. If you want to ensure that cloudHPC has completed its set up before running the next line of code, you'll need to do
wait(cloudHPC) % wait until cloudHPC is ready to accept jobs
Other than that, the structure of the batch command is identical again.
cloudJob = batch(cloudHPC,@bigComputation,2,{N},Pool=63); % Submit to 63 workers on the cloud cluster
In the configuration of cloudHPC (Not discussed here), I've set it up to auto-shutdown once the cluster is idle. As such, I don't need to worry about running up a huge cloud bill by accident.
If I wanted to be sure though, I could first wait for the job to finish
wait(cloudJob)
and then shutdown the cluster explicitly
shutdown(cloudHPC)
Getting the results
The jobs are running on my machines and I could have monitored their state if I liked using the Job Monitor. Instead, I just went for a walk in the sunshine and now its time to get my results. I can fetch the outputs of any job with the fetchOutputs command. Here's the result from the run on my local machine.
fetchOutputs(LocalJob)
Recall that my function bigCompute(N) had 2 outputs and here they are! Hardly seems worth it for several hours of compute but that's often the way of big HPC jobs. Let's hope that those numbers really mean something scientifically!
Fetching the outputs from the SLURM and Cloud machines is just as easy. Here's the result from the SLURM cluster
fetchOutputs(onPremJob)
Same calculation, same result! No surprise there but think about what was actually done here? MATLAB connected to the cluster, fetched the results and returned them to this Live Script with just a single command. No ssh, no sftp, just MATLAB
Want this on your HPC Cluster?
I have omitted so many details it's almost criminal! There is so much more that can be done that we could (and do!) make entire, day-long tutorials on this stuff. The aim of this post is to introduce the skeleton of this workflow, showing how easy HPC in MATLAB can be, and hopefully start some conversations with MATLAB users and HPC sysadmins around the world. If you want to learn more, message me on Twitter, LinkedIn or head over to MATLAB Parallel Server - MATLAB (mathworks.com) and click the link at the bottom right.
I'll also be at International Super Computing 2023 in Hamburg next week wearing a MATLAB cap for anyone who wants to talk in person
댓글
댓글을 남기려면 링크 를 클릭하여 MathWorks 계정에 로그인하거나 계정을 새로 만드십시오.