The MATLAB Blog

Practical Advice for People on the Leading Edge

How to make a GPU version of this MATLAB program by changing two lines

In his article, A short game of Life, Steve Eddins showed us the following few lines of code that impemented Conway's game of life. Steve's version used a 750 x 750 gameboard whereas mine is using 2000 x 2000 because I want something meaty to compute
clear % Clear all variables
 
tic
im = rand(2000,2000)>0.8;
for k=1:500
t = conv2(im(:,:,k),[2,2,2;2,1,2;2,2,2],"same");
im(:,:,1,k+1) = (t > 4) & (t < 8);
end
OriginalTime = toc
OriginalTime = 39.4715
As Steve showed us, this is extremely easy to turn into an animated gif. Since my gameboard is so big, I'll just zoom in on a small section of it
imwrite(~im(1:200,1:200,1,:),"Life.gif","DelayTime",0.02,"LoopCount",Inf)
life.gif
The first step to faster code: Preallocation
I'm obsessed with speed in the MATLAB programming language and wondered if there is anything that can be done with these few lines to speed them up without ruining their elegance too much.
Let's start with something easy: preallocation of the output arrays, im
clear % Clear all variables
 
tic
im = zeros(2000,2000,1,501,"logical");
im(:,:,1,1) = rand(2000,2000) > 0.8;
for k=1:500
t = conv2(im(:,:,k),[2,2,2;2,1,2;2,2,2],"same");
im(:,:,1,k+1) = (t > 4) & (t < 8);
end
PreallocatedTime = toc
PreallocatedTime = 14.2942
Over 2x speed up for one extra line and one modified line. Not bad but can I go any further?
Moving from a CPU implementation to a GPU implementation of many MATLAB codes is very easy!
My computer has quite a nice NVIDIA GPU which I can access using Parallel Computing Toolbox
gpuDevice()
ans =
CUDADevice with properties: Name: 'NVIDIA GeForce RTX 3070' Index: 1 ComputeCapability: '8.6' SupportsDouble: 1 DriverVersion: 11.6000 ToolkitVersion: 11.2000 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [2.1475e+09 65535 65535] SIMDWidth: 32 TotalMemory: 8.5894e+09 AvailableMemory: 7.2939e+09 MultiprocessorCount: 46 ClockRateKHz: 1725000 ComputeMode: 'Default' GPUOverlapsTransfers: 1 KernelExecutionTimeout: 1 CanMapHostMemory: 1 DeviceSupported: 1 DeviceAvailable: 1 DeviceSelected: 1
GPUs are perfectly suited to this kind of thing but writing a GPU version of this would be difficult right? Late nights of hardcore CUDA coding await.
But there's another way!
Well over 1000 MATLAB functions (including those in the official Toolboxes) are 'overloaded' with the gpuArray data type. What this means in practice is that whereas this code runs on the CPU:
A = rand(3);
B = rand(4);
Cfull = conv2(A,B)
Cfull = 6×6
0.2357 0.1997 0.7748 0.4111 0.5196 0.0080 0.2237 0.4145 1.3158 1.0064 1.2236 0.0201 0.6943 0.7811 2.3386 1.8777 1.5679 0.0412 0.3726 0.9340 2.2665 2.3125 1.7186 0.5522 0.5210 0.8183 1.7277 1.9970 1.4909 0.8490 0.2133 0.8088 1.2790 1.8410 1.2599 0.5397
This code runs on the GPU:
gpuA = gpuArray(A); % Transfer A to the GPU and call it gpuA
gpuB = gpuArray(B); % Transfer B to the GPU and call it gpuB
Cfull_gpu = conv2(gpuA,gpuB) % This now runs on the GPU
Cfull_gpu = 0.2357 0.1997 0.7748 0.4111 0.5196 0.0080 0.2237 0.4145 1.3158 1.0064 1.2236 0.0201 0.6943 0.7811 2.3386 1.8777 1.5679 0.0412 0.3726 0.9340 2.2665 2.3125 1.7186 0.5522 0.5210 0.8183 1.7277 1.9970 1.4909 0.8490 0.2133 0.8088 1.2790 1.8410 1.2599 0.5397
All you need to do in order to make over 1000 MATLAB functions work on an NVIDIA GPU is Parallel Computing Toolbox which allows you to give them gpuArrays instead of normal arrays.
Whether or not you'll actually get a speed up depends on many factors but to get started, to simply get things running on the GPU instead of the CPU, this is it!
Back to Steve's code. What I need to do is change the initialisation of im to a gpuArray and everything will automagically run on the GPU. That is I change
im = zeros(2000,2000,1,501,"logical");
im(:,:,1,1) = rand(2000,2000) > 0.8;
to
im = zeros(2000,2000,1,501,"logical","gpuArray");
im(:,:,1,1) = rand(2000,2000,"gpuArray") > 0.8;
Let's give it a try
clear % Clear all variables
 
dev = gpuDevice();
tic
im = zeros(2000,2000,1,501,"logical","gpuArray");
im(:,:,1,1) = rand(2000,2000,"gpuArray") > 0.8;
for k=1:500
t = conv2(im(:,:,k),[2,2,2;2,1,2;2,2,2],"same");
im(:,:,1,k+1) = (t > 4) & (t < 8);
end
wait(dev);
GpuTime = toc
GpuTime = 5.3013
Almost 3x faster than the CPU version that used preallocated arrays and around 7x faster than the original! Not bad considering I only changed 2 lines of code. Furthermore, this has to be the easiest GPU 'port' of a simulation I've ever written
Now, I am sure that there are CUDA experts out there who could do better than this -- squeezing every last drop of performance possible from the poor overworked GPU -- but 3x speedup for so little work is pretty good going and there are several options in the MATLAB ecosystem that allow you to go deeper and explore other approaches.

What's going on with wait(dev)?

The eagle-eyed among you might have noticed that my GPU version has an extra couple of lines in it that I haven't mentioned yet. As we speak, I can feel you reaching for the comment button to tell me that I lied to you....I changed four lines not two! Here are the two lines I conveniently chose not to mention to you
The reason for these lines is all about timing. You see, when you ask for a computation to be done on the GPU, MATLAB kicks things off and moves to the next line without waiting for the GPU to finish the calculation. This can be used for some very nifty interleaving of code, where you have things running on the CPU and GPU simultaneously, but it can also mess up timing if you use tic/toc. Timing GPU code can be tricky which is why MathWorks also give you the gputimeit command.
If all I wanted to do was run the code, and not time it, then I wouldn't need to bother with these two extra lines. So what I told you is true... from a certain point of view.

System details

  • MATLAB R2022a
  • CPU: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
  • GPU: NVIDIA GeForce RTX 3070
  • OS: Windows 11

Over to you

Have you got some code that can be easily made to run on a GPU like this and show a performance boost?
|
  • print

댓글

댓글을 남기려면 링크 를 클릭하여 MathWorks 계정에 로그인하거나 계정을 새로 만드십시오.