Today I'd like to introduce James Lebak. James is a developer who works on GPU support in the Parallel Computing Toolbox.
One common use of multi-GPU systems is to perform Monte Carlo simulations. The use of more GPUs can increase the number of samples that can be simulated, resulting in a more accurate simulation.
Let's begin by looking at the basic option pricing simulation from the Parallel Computing Toolbox example "Exotic Option Pricing on a GPU using a Monte-Carlo Method". In this example, we run many concurrent simulations of the evolution of the stock price. The mean and distribution of the simulation outputs gives us a sense of the ultimate value of the stock.
The function simulateStockPrice that describes the stock price is a discretization of a stochastic differential equation. The equation assumes that prices evolve according to a log-normal distribution related to the risk-free interest rate, the dividend yield (if any), and the volatility in the market.
function finalStockPrice = simulateStockPrice(stockPrice, rate,... dividend, volatility,... Tfinal, dT) % Discrete simulation of stock price t = 0; while t < Tfinal t = t + dT; dr = (rate - dividend - volatility*volatility/2)*dT; perturbation = volatility*sqrt(dT)*randn(); stockPrice = stockPrice*exp(dr + perturbation); end finalStockPrice = stockPrice; end
Achieving good performance on a GPU requires executing many operations in parallel. The GPU has thousands of individual computational units, and you get best performance from them when you execute hundreds of thousands or millions of parallel operations. This is similar to the advice we often give to vectorize functions in MATLAB for better performance. The function simulateStockPrice produces a single, scalar output, so if we run it as it is, it won't achieve good performance.
We can execute the simulation many times on the GPU using ARRAYFUN. The arrayfun function on the GPU takes a function handle and a gpuArray input, and executes the function on each element of the input gpuArray. This is an alternative to vectorizing the function, which would require changing more code. It returns one output for each element in the input. This is exactly what we want to do in a Monte Carlo simulation: we want to produce many independent outputs.
The runSimulationOnOneGPU function uses arrayfun to perform many simulations in parallel and returns the mean price.
function mfp = runSimulationOnOneGPU(Nsamples) % Run a single stock simulation on the GPU and return the % mean file price on the CPU. stockPrice = 100; % Stock price starts at $100. dividend = 0.01; % 1% annual dividend yield. riskFreeRate = 0.005; % 0.5 percent. timeToExpiry = 2; % Lifetime of the option in years. sampleRate = 1/250; % Assume 250 working days per year. volatility = 0.20; % 20% volatility. % Create the input data. Any scalar inputs are expanded to the size of the % array inputs. In this case, the starting stock prices is a vector whose % length is the number of simulations to perform on the GPU at once, and % all of the other inputs are scalars. startPrices = stockPrice*gpuArray.ones(Nsamples, 1); % Run all Nsamples simulations in parallel with one call to arrayfun. finalPrices = arrayfun( @simulateStockPrice, ... startPrices, riskFreeRate, dividend, volatility, ... timeToExpiry, sampleRate ); mfp = gather(mean(finalPrices)); end
We execute this function with a relatively small number of samples and display the computed mean.
nSamples = 1e6; meanFinalPrice = runSimulationOnOneGPU(nSamples); disp(['Calculated mean final price of ', num2str(meanFinalPrice)]);
Calculated mean final price of 98.96
Assume that we want to run this simulation on a desktop machine with two GPUs. This will allow us to have more samples in the same amount of time, which by the law of large numbers should give us a better approximation of the mean final stock price.
With Parallel Computing Toolbox, MATLAB can perform calculations on multiple GPUs. In order to do so, we need to open a MATLAB pool with one worker for each GPU. One MATLAB worker is needed to communicate with each GPU.
The workers in the pool will do all the calculations, so the client doesn't need to use a gpuDevice. Deselect the device on the client, so that all the memory on all the devices is completely available to the workers.
Determine how many GPUs we have on this machine and open a local MATLAB pool of that size. This gives us one worker for each GPU on this machine.
nGPUs = gpuDeviceCount(); matlabpool('local', nGPUs);
Starting matlabpool using the 'local' profile ... connected to 2 workers.
The runSimulationOnManyGPUs function contains a parfor loop that executes nIter times. Each iteration of the parfor loop executes nSamples independent Monte Carlo simulations on one GPU and returns the mean final price over all those simulations. The output of the overall simulation is the mean of the individual means, because each parfor iteration executes the same number of independent simulations. When the loop finishes we have executed nIter*nSamples simulations.
function [tout, mfp] = runSimulationOnManyGPUs(nSamples, nIter) % Executes a total of nSamples*nIter simulations, distributing the % simulations over the workers in a matlabpool in groups of nSamples each. tic; parfor ix = 1:nIter meanFinalPrice(ix) = runSimulationOnOneGPU(nSamples); end mfp = mean(meanFinalPrice); tout = toc;
To run nSamples iterations on each GPU, we call runSimulationOnManyGPUs with nIter equal to the number of GPUs.
[tout, meanFinalPrice] = runSimulationOnManyGPUs(nSamples, nGPUs); disp(['Performing simulations on the GPU took ', num2str(tout), ' s']); disp(['Calculated mean final price of ', num2str(meanFinalPrice)]);
Performing simulations on the GPU took 8.2226 s Calculated mean final price of 98.9855
It is important that the results are returned as regular MATLAB arrays rather than as gpuArrays. If the result is returned as a gpuArray then the client will share a GPU with one of the workers, which would unnecessarily use more memory on the device and transfer data over the PCI bus.
The recommended setup for MATLAB when using multiple GPUs is, as we discussed above, to open one worker for each GPU. Let's dig in a little more to understand the details of how the workers interact with GPUs.
When workers share a single machine with multiple GPUs, MATLAB automatically assigns each worker in a parallel pool to use a different GPU by default. You can see this by using the spmd command to examine the index of the device used by each worker.
spmd gd = gpuDevice; idx = gd.Index; disp(['Using GPU ',num2str(idx)]); end
Lab 1: Using GPU 1 Lab 2: Using GPU 2
NVIDIA GPUs can be operated in one of four compute modes: default, exclusive thread, exclusive process, or prohibited. The compute mode is shown in the 'ComputeMode' field of the structure returned by gpuDevice. If a GPU is in prohibited mode, no worker will be assigned to use that GPU. If a GPU is in exclusive process or exclusive thread mode, only one MATLAB worker will attempt to access that GPU.
It is possible for multiple workers to share the same GPU, if the GPU is in 'Default' compute mode. When this is done, the GPU driver serializes accesses to the device. You should be aware that there will be a performance penalty because of the serialization. There will also be less memory available on the GPU for each worker, because two or more workers will be sharing the same GPU memory space. In general it will be hard to achieve good performance when multiple workers share the same GPU.
You can customize the way that MATLAB assigns workers to GPUs to suit your own needs by overriding the MATLAB function selectGPU. See the help for selectGPU for more details.
MATLAB initializes each worker to use a different random number stream on the GPU by default. Doing so ensures that the values obtained by each worker are different. In this example, we have for simplicity elected to use the default random number generation stream on the GPU, the well-known MRG32K3A stream. While MRG32K3A is a highly regarded algorithm with good support for parallelism, there are other streams that you can select that may provide better performance. The documentation page Using GPUArray describes the options available for controlling random number generation on the GPU, and lists the different streams that you can select.
With all of these details in place, it's time to consider how to put multiple GPUs to use when hundreds of millions or even billions of simulations are desired. Such a situation might arise, for example, when we want different values of key parameters such as volatility or the dividend in each simulation.
Let's assume we want to run 800 million simulations, and that we have a cluster with 16 NVIDIA C2050 GPUs available. The problem is that a single GPU has a limited amount of memory. A single NVIDIA C2050 compute card has about 3 GB of memory and can compute on the order of $5\times10^7$ points in runSimulationOnOneGPU without running out of memory. We can call runSimulationOnOneGPU 16 times to complete all the desired simulations. On a single machine, doing so takes around 10 minutes.
We measured the time that it takes to run all $8\times 10^8$ simulations with up to 16 GPUs. As you can see, we got nearly-perfect linear scaling in this experiment.
How many GPUs do you use in your application, and how do they interact? Let me know in the comments.