{"id":709,"date":"2013-06-24T07:51:44","date_gmt":"2013-06-24T12:51:44","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=709"},"modified":"2016-08-03T14:28:44","modified_gmt":"2016-08-03T19:28:44","slug":"running-monte-carlo-simulations-on-multiple-gpus","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2013\/06\/24\/running-monte-carlo-simulations-on-multiple-gpus\/","title":{"rendered":"Running Monte Carlo Simulations on Multiple GPUs"},"content":{"rendered":"<div class=\"content\"><!--introduction--><p>Today I'd like to introduce James Lebak. James is a developer who works on GPU support in the Parallel Computing Toolbox.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#0848b1ee-dfb7-461a-8d19-411f2853d07a\">Basic Option Pricing on the GPU<\/a><\/li><li><a href=\"#fef40a35-8ef4-40a0-b79e-bb55c06bb1de\">Using Multiple GPUs on One Machine<\/a><\/li><li><a href=\"#ccbaa1c2-50c7-4bc9-969c-5ec4dd4b0ace\">Multi-GPU Execution Details<\/a><\/li><li><a href=\"#426e83df-6735-4dc4-aa5d-6423450d7615\">Using Multiple GPUs in a Cluster<\/a><\/li><\/ul><\/div><h4>Basic Option Pricing on the GPU<a name=\"0848b1ee-dfb7-461a-8d19-411f2853d07a\"><\/a><\/h4><p>One common use of multi-GPU systems is to perform Monte Carlo simulations. The use of more GPUs can increase the number of samples that can be simulated, resulting in a more accurate simulation.<\/p><p>Let's begin by looking at the basic option pricing simulation from the <a href=\"https:\/\/www.mathworks.com\/products\/parallel-computing\">Parallel Computing Toolbox<\/a> example <a href=\"https:\/\/www.mathworks.com\/help\/distcomp\/examples\/using-gpu-arrayfun-for-monte-carlo-simulations.html\">\"Exotic Option Pricing on a GPU using a Monte-Carlo Method\"<\/a>. In this example, we run many concurrent simulations of the evolution of the stock price. The mean and distribution of the simulation outputs gives us a sense of the ultimate value of the stock.<\/p><p>The function <tt>simulateStockPrice<\/tt> that describes the stock price is a discretization of a stochastic differential equation. The equation assumes that prices evolve according to a log-normal distribution related to the risk-free interest rate, the dividend yield (if any), and the volatility in the market.<\/p><pre class=\"codeinput\">type <span class=\"string\">simulateStockPrice<\/span>;\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction finalStockPrice = simulateStockPrice(stockPrice, rate,...\r\n                                              dividend, volatility,...\r\n                                              Tfinal, dT)\r\n% Discrete simulation of stock price\r\nt = 0;\r\nwhile t &lt; Tfinal\r\n    t = t + dT;\r\n    dr = (rate - dividend - volatility*volatility\/2)*dT;\r\n    perturbation = volatility*sqrt(dT)*randn();\r\n    stockPrice = stockPrice*exp(dr + perturbation);\r\nend\r\nfinalStockPrice = stockPrice;\r\nend\r\n\r\n<\/pre><p>Achieving good performance on a GPU requires executing many operations in parallel. The GPU has thousands of individual computational units, and you get best performance from them when you execute hundreds of thousands or millions of parallel operations. This is similar to the advice we often give to <a href=\"https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/vectorization.html\">vectorize<\/a> functions in MATLAB for better performance. The function <tt>simulateStockPrice<\/tt> produces a single, scalar output, so if we run it as it is, it won't achieve good performance.<\/p><p>We can execute the simulation many times on the GPU using <a title=\"https:\/\/www.mathworks.com\/help\/distcomp\/arrayfun.html (link no longer works)\">ARRAYFUN<\/a>. The <tt>arrayfun<\/tt> function on the GPU takes a function handle and a <tt>gpuArray<\/tt> input, and executes the function on each element of the input <tt>gpuArray<\/tt>. This is an alternative to vectorizing the function, which would require changing more code. It returns one output for each element in the input. This is exactly what we want to do in a Monte Carlo simulation: we want to produce many independent outputs.<\/p><p>The <tt>runSimulationOnOneGPU<\/tt> function uses <tt>arrayfun<\/tt> to perform many simulations in parallel and returns the mean price.<\/p><pre class=\"codeinput\">type <span class=\"string\">runSimulationOnOneGPU.m<\/span>;\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction mfp = runSimulationOnOneGPU(Nsamples)\r\n% Run a single stock simulation on the GPU and return the \r\n% mean file price on the CPU.\r\n\r\nstockPrice   = 100;   % Stock price starts at $100.\r\ndividend     = 0.01;  % 1% annual dividend yield.\r\nriskFreeRate = 0.005; % 0.5 percent.\r\ntimeToExpiry = 2;     % Lifetime of the option in years.\r\nsampleRate   = 1\/250; % Assume 250 working days per year.\r\nvolatility   = 0.20;  % 20% volatility.\r\n\r\n% Create the input data. Any scalar inputs are expanded to the size of the\r\n% array inputs. In this case, the starting stock prices is a vector whose\r\n% length is the number of simulations to perform on the GPU at once, and\r\n% all of the other inputs are scalars.\r\nstartPrices = stockPrice*gpuArray.ones(Nsamples, 1);\r\n\r\n% Run all Nsamples simulations in parallel with one call to arrayfun.\r\nfinalPrices = arrayfun( @simulateStockPrice, ...\r\n    startPrices, riskFreeRate, dividend, volatility, ...\r\n    timeToExpiry, sampleRate );\r\nmfp = gather(mean(finalPrices));\r\nend\r\n\r\n<\/pre><p>We execute this function with a relatively small number of samples and display the computed mean.<\/p><pre class=\"codeinput\">nSamples = 1e6;\r\nmeanFinalPrice = runSimulationOnOneGPU(nSamples);\r\ndisp([<span class=\"string\">'Calculated mean final price of '<\/span>, num2str(meanFinalPrice)]);\r\n<\/pre><pre class=\"codeoutput\">Calculated mean final price of 98.96\r\n<\/pre><h4>Using Multiple GPUs on One Machine<a name=\"fef40a35-8ef4-40a0-b79e-bb55c06bb1de\"><\/a><\/h4><p>Assume that we want to run this simulation on a desktop machine with two GPUs. This will allow us to have more samples in the same amount of time, which by the law of large numbers should give us a better approximation of the mean final stock price.<\/p><p>With Parallel Computing Toolbox, MATLAB can perform calculations on multiple GPUs. In order to do so, we need to open a MATLAB pool with one worker for each GPU. One MATLAB worker is needed to communicate with each GPU.<\/p><p>The workers in the pool will do all the calculations, so the client doesn't need to use a <tt>gpuDevice<\/tt>. Deselect the device on the client, so that all the memory on all the devices is completely available to the workers.<\/p><pre class=\"codeinput\">gpuDevice([]);\r\n<\/pre><p>Determine how many GPUs we have on this machine and open a local MATLAB pool of that size. This gives us one worker for each GPU on this machine.<\/p><pre class=\"codeinput\">nGPUs = gpuDeviceCount();\r\nmatlabpool(<span class=\"string\">'local'<\/span>, nGPUs);\r\n<\/pre><pre class=\"codeoutput\">Starting matlabpool using the 'local' profile ... connected to 2 workers.\r\n<\/pre><p>The <tt>runSimulationOnManyGPUs<\/tt> function contains a <tt>parfor<\/tt> loop that executes <tt>nIter<\/tt> times. Each iteration of the <tt>parfor<\/tt> loop executes <tt>nSamples<\/tt> independent Monte Carlo simulations on one GPU and returns the mean final price over all those simulations. The output of the overall simulation is the mean of the individual means, because each <tt>parfor<\/tt> iteration executes the same number of independent simulations. When the loop finishes we have executed <tt>nIter*nSamples<\/tt> simulations.<\/p><pre class=\"codeinput\">type <span class=\"string\">runSimulationOnManyGPUs.m<\/span>;\r\n<\/pre><pre class=\"codeoutput\">\r\nfunction [tout, mfp] = runSimulationOnManyGPUs(nSamples, nIter)\r\n% Executes a total of nSamples*nIter simulations, distributing the\r\n% simulations over the workers in a matlabpool in groups of nSamples each.\r\ntic;\r\nparfor ix = 1:nIter\r\n    meanFinalPrice(ix) = runSimulationOnOneGPU(nSamples);\r\nend\r\nmfp = mean(meanFinalPrice);\r\ntout = toc;\r\n\r\n<\/pre><p>To run <tt>nSamples<\/tt> iterations on each GPU, we call <tt>runSimulationOnManyGPUs<\/tt> with <tt>nIter<\/tt> equal to the number of GPUs.<\/p><pre class=\"codeinput\">[tout, meanFinalPrice] = runSimulationOnManyGPUs(nSamples, nGPUs);\r\ndisp([<span class=\"string\">'Performing simulations on the GPU took '<\/span>, num2str(tout), <span class=\"string\">' s'<\/span>]);\r\ndisp([<span class=\"string\">'Calculated mean final price of '<\/span>, num2str(meanFinalPrice)]);\r\n<\/pre><pre class=\"codeoutput\">Performing simulations on the GPU took 8.2226 s\r\nCalculated mean final price of 98.9855\r\n<\/pre><p>It is important that the results are returned as regular MATLAB arrays rather than as <tt>gpuArrays<\/tt>. If the result is returned as a <tt>gpuArray<\/tt> then the client will share a GPU with one of the workers, which would unnecessarily use more memory on the device and transfer data over the PCI bus.<\/p><h4>Multi-GPU Execution Details<a name=\"ccbaa1c2-50c7-4bc9-969c-5ec4dd4b0ace\"><\/a><\/h4><p>The recommended setup for MATLAB when using multiple GPUs is, as we discussed above, to open one worker for each GPU. Let's dig in a little more to understand the details of how the workers interact with GPUs.<\/p><p>When workers share a single machine with multiple GPUs, MATLAB automatically assigns each worker in a parallel pool to use a different GPU by default. You can see this by using the spmd command to examine the index of the device used by each worker.<\/p><pre class=\"codeinput\"><span class=\"keyword\">spmd<\/span>\r\n    gd = gpuDevice;\r\n    idx = gd.Index;\r\n    disp([<span class=\"string\">'Using GPU '<\/span>,num2str(idx)]);\r\n<span class=\"keyword\">end<\/span>\r\n<\/pre><pre class=\"codeoutput\">Lab 1: \r\n  Using GPU 1\r\nLab 2: \r\n  Using GPU 2\r\n<\/pre><p>NVIDIA GPUs can be operated in one of four compute modes: default, exclusive thread, exclusive process, or prohibited. The compute mode is shown in the 'ComputeMode' field of the structure returned by gpuDevice. If a GPU is in prohibited mode, no worker will be assigned to use that GPU. If a GPU is in exclusive process or exclusive thread mode, only one MATLAB worker will attempt to access that GPU.<\/p><p>It is possible for multiple workers to share the same GPU, if the GPU is in 'Default' compute mode. When this is done, the GPU driver serializes accesses to the device. You should be aware that there will be a performance penalty because of the serialization. There will also be less memory available on the GPU for each worker, because two or more workers will be sharing the same GPU memory space. In general it will be hard to achieve good performance when multiple workers share the same GPU.<\/p><p>You can customize the way that MATLAB assigns workers to GPUs to suit your own needs by overriding the MATLAB function <tt>selectGPU<\/tt>. See the help for <tt>selectGPU<\/tt> for more details.<\/p><p>MATLAB initializes each worker to use a different random number stream on the GPU by default. Doing so ensures that the values obtained by each worker are different. In this example, we have for simplicity elected to use the default random number generation stream on the GPU, the well-known MRG32K3A stream. While MRG32K3A is a highly regarded algorithm with good support for parallelism, there are other streams that you can select that may provide better performance. The documentation page Using GPUArray describes the options available for controlling random number generation on the GPU, and lists the different streams that you can select.<\/p><h4>Using Multiple GPUs in a Cluster<a name=\"426e83df-6735-4dc4-aa5d-6423450d7615\"><\/a><\/h4><p>With all of these details in place, it's time to consider how to put multiple GPUs to use when hundreds of millions or even billions of simulations are desired. Such a situation might arise, for example, when we want different values of key parameters such as volatility or the dividend in each simulation.<\/p><p>Let's assume we want to run 800 million simulations, and that we have a cluster with 16 NVIDIA C2050 GPUs available. The problem is that a single GPU has a limited amount of memory. A single NVIDIA C2050 compute card has about 3 GB of memory and can compute on the order of $5\\times10^7$ points in <tt>runSimulationOnOneGPU<\/tt> without running out of memory. We can call <tt>runSimulationOnOneGPU<\/tt> 16 times to complete all the desired simulations. On a single machine, doing so takes around 10 minutes.<\/p><p>We measured the time that it takes to run all $8\\times 10^8$ simulations with up to 16 GPUs. As you can see, we got nearly-perfect linear scaling in this experiment.<\/p><p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/ClusterOptionPricing.png\" alt=\"\"> <\/p><p>How many GPUs do you use in your application, and how do they interact? Let me know in the <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=709#respond\">comments<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_dd972ea92b1841e493c7bb678a7568e9() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='dd972ea92b1841e493c7bb678a7568e9 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' dd972ea92b1841e493c7bb678a7568e9';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2013 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_dd972ea92b1841e493c7bb678a7568e9()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2013a<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2013a<br><\/p><\/div><!--\r\ndd972ea92b1841e493c7bb678a7568e9 ##### SOURCE BEGIN #####\r\n%% Running Monte Carlo Simulations on Multiple GPUs\r\n% Today I'd like to introduce James Lebak. James is a developer who works\r\n% on GPU support in the Parallel Computing Toolbox.\r\n\r\n%% Basic Option Pricing on the GPU\r\n% One common use of multi-GPU systems is to perform Monte Carlo\r\n% simulations. The use of more GPUs can increase the number of samples that\r\n% can be simulated, resulting in a more accurate simulation. \r\n%\r\n% Let's begin by looking at the basic option pricing simulation from the\r\n% <https:\/\/www.mathworks.com\/products\/parallel-computing Parallel Computing Toolbox> \r\n% example \r\n% <https:\/\/www.mathworks.com\/help\/distcomp\/examples\/exotic-option-pricing-on-a-gpu-using-a-monte-carlo-method.html \"Exotic Option Pricing on a GPU using a Monte-Carlo Method\">. \r\n% In this example, we run many concurrent simulations of the evolution of\r\n% the stock price. The mean and distribution of the simulation outputs\r\n% gives us a sense of the ultimate value of the stock.\r\n%\r\n% The function |simulateStockPrice| that describes the stock price is a\r\n% discretization of a stochastic differential equation. The equation\r\n% assumes that prices evolve according to a log-normal distribution related\r\n% to the risk-free interest rate, the dividend yield (if any), and the\r\n% volatility in the market.\r\ntype simulateStockPrice;\r\n\r\n%%\r\n% Achieving good performance on a GPU requires executing many operations in\r\n% parallel. The GPU has thousands of individual computational units, and\r\n% you get best performance from them when you execute hundreds of thousands\r\n% or millions of parallel operations. This is similar to the advice we\r\n% often give to\r\n% <https:\/\/www.mathworks.com\/help\/matlab\/matlab_prog\/vectorization.html vectorize> \r\n% functions in MATLAB for better performance. The function\r\n% |simulateStockPrice| produces a single, scalar output, so if we run it as\r\n% it is, it won't achieve good performance.\r\n\r\n%%\r\n% We can execute the simulation many times on the GPU using\r\n% <https:\/\/www.mathworks.com\/help\/distcomp\/arrayfun.html ARRAYFUN>. The\r\n% |arrayfun| function on the GPU takes a function handle and a |gpuArray|\r\n% input, and executes the function on each element of the input |gpuArray|.\r\n% This is an alternative to vectorizing the function, which would require\r\n% changing more code. It returns one output for each element in the input.\r\n% This is exactly what we want to do in a Monte Carlo simulation: we want\r\n% to produce many independent outputs.\r\n\r\n%%\r\n% The |runSimulationOnOneGPU| function uses |arrayfun| to perform many\r\n% simulations in parallel and returns the mean price.\r\ntype runSimulationOnOneGPU.m;\r\n\r\n%%\r\n% We execute this function with a relatively small number of samples and\r\n% display the computed mean.\r\nnSamples = 1e6;\r\nmeanFinalPrice = runSimulationOnOneGPU(nSamples);\r\ndisp(['Calculated mean final price of ', num2str(meanFinalPrice)]);\r\n\r\n%% Using Multiple GPUs on One Machine\r\n% Assume that we want to run this simulation on a desktop machine with two\r\n% GPUs. This will allow us to have more samples in the same amount of time,\r\n% which by the law of large numbers should give us a better approximation\r\n% of the mean final stock price.\r\n%\r\n% With Parallel Computing Toolbox, MATLAB can perform calculations on\r\n% multiple GPUs. In order to do so, we need to open a MATLAB pool with one\r\n% worker for each GPU. One MATLAB worker is needed to communicate with each\r\n% GPU.\r\n\r\n%%\r\n% The workers in the pool will do all the calculations, so the client\r\n% doesn't need to use a |gpuDevice|. Deselect the device on the client, so\r\n% that all the memory on all the devices is completely available to the\r\n% workers.\r\ngpuDevice([]);\r\n\r\n%%\r\n% Determine how many GPUs we have on this machine and open a local MATLAB\r\n% pool of that size. This gives us one worker for each GPU on this machine.\r\nnGPUs = gpuDeviceCount();\r\nmatlabpool('local', nGPUs);\r\n\r\n%%\r\n% The |runSimulationOnManyGPUs| function contains a |parfor| loop that\r\n% executes |nIter| times. Each iteration of the |parfor| loop executes\r\n% |nSamples| independent Monte Carlo simulations on one GPU and returns the\r\n% mean final price over all those simulations. The output of the overall\r\n% simulation is the mean of the individual means, because each |parfor|\r\n% iteration executes the same number of independent simulations. When the\r\n% loop finishes we have executed |nIter*nSamples| simulations.\r\ntype runSimulationOnManyGPUs.m;\r\n\r\n%% \r\n% To run |nSamples| iterations on each GPU, we call\r\n% |runSimulationOnManyGPUs| with |nIter| equal to the number of GPUs.\r\n[tout, meanFinalPrice] = runSimulationOnManyGPUs(nSamples, nGPUs);\r\ndisp(['Performing simulations on the GPU took ', num2str(tout), ' s']);\r\ndisp(['Calculated mean final price of ', num2str(meanFinalPrice)]);\r\n\r\n%%\r\n% It is important that the results are returned as regular MATLAB arrays\r\n% rather than as |gpuArrays|. If the result is returned as a |gpuArray|\r\n% then the client will share a GPU with one of the workers, which would\r\n% unnecessarily use more memory on the device and transfer data over the\r\n% PCI bus.\r\n\r\n%% Multi-GPU Execution Details\r\n% The recommended setup for MATLAB when using multiple GPUs is, as we\r\n% discussed above, to open one worker for each GPU. Let's dig in a little\r\n% more to understand the details of how the workers interact with GPUs.\r\n%\r\n% When workers share a single machine with multiple GPUs, MATLAB\r\n% automatically assigns each worker in a parallel pool to use a different\r\n% GPU by default. You can see this by using the spmd command to examine the\r\n% index of the device used by each worker.\r\n\r\nspmd\r\n    gd = gpuDevice;\r\n    idx = gd.Index;\r\n    disp(['Using GPU ',num2str(idx)]);\r\nend\r\n\r\n%%\r\n% NVIDIA GPUs can be operated in one of four compute modes: default,\r\n% exclusive thread, exclusive process, or prohibited. The compute mode is\r\n% shown in the 'ComputeMode' field of the structure returned by gpuDevice.\r\n% If a GPU is in prohibited mode, no worker will be assigned to use that\r\n% GPU. If a GPU is in exclusive process or exclusive thread mode, only one\r\n% MATLAB worker will attempt to access that GPU.\r\n%\r\n% It is possible for multiple workers to share the same GPU, if the GPU is\r\n% in 'Default' compute mode. When this is done, the GPU driver serializes\r\n% accesses to the device. You should be aware that there will be a\r\n% performance penalty because of the serialization. There will also be less\r\n% memory available on the GPU for each worker, because two or more workers\r\n% will be sharing the same GPU memory space. In general it will be hard to\r\n% achieve good performance when multiple workers share the same GPU.\r\n% \r\n% You can customize the way that MATLAB assigns workers to GPUs to suit\r\n% your own needs by overriding the MATLAB function |selectGPU|. See the\r\n% help for |selectGPU| for more details.\r\n\r\n%%\r\n% MATLAB initializes each worker to use a different random number stream on\r\n% the GPU by default. Doing so ensures that the values obtained by each\r\n% worker are different. In this example, we have for simplicity elected to\r\n% use the default random number generation stream on the GPU, the\r\n% well-known MRG32K3A stream. While MRG32K3A is a highly regarded algorithm\r\n% with good support for parallelism, there are other streams that you can\r\n% select that may provide better performance. The documentation page\r\n% <https:\/\/www.mathworks.com\/help\/distcomp\/using-gpuarray.html Using GPUArray> \r\n% describes the options available for controlling random number generation\r\n% on the GPU, and lists the different streams that you can select. \r\n\r\n%% Using Multiple GPUs in a Cluster\r\n% With all of these details in place, it's time to consider how to put\r\n% multiple GPUs to use when hundreds of millions or even billions of\r\n% simulations are desired. Such a situation might arise, for example, when\r\n% we want different values of key parameters such as volatility or the\r\n% dividend in each simulation. \r\n\r\n%%\r\n% Let's assume we want to run 800 million simulations, and that we have a\r\n% cluster with 16 NVIDIA C2050 GPUs available. The problem is that a single\r\n% GPU has a limited amount of memory. A single NVIDIA C2050 compute card\r\n% has about 3 GB of memory and can compute on the order of $5\\times10^7$\r\n% points in |runSimulationOnOneGPU| without running out of memory. We can\r\n% call |runSimulationOnOneGPU| 16 times to complete all the desired\r\n% simulations. On a single machine, doing so takes around 10 minutes.\r\n\r\n%%\r\n% We measured the time that it takes to run all $8\\times 10^8$ simulations\r\n% with up to 16 GPUs. As you can see, we got nearly-perfect linear scaling\r\n% in this experiment.\r\n%\r\n% <<ClusterOptionPricing.png>>\r\n%\r\n\r\n%%\r\n% How many GPUs do you use in your application, and how do they interact?\r\n% Let me know in the <https:\/\/blogs.mathworks.com\/loren\/?p=709#respond comments>.\r\n\r\n\r\n##### SOURCE END ##### dd972ea92b1841e493c7bb678a7568e9\r\n-->\r\n","protected":false},"excerpt":{"rendered":"<div class=\"overview-image\"><img decoding=\"async\"  class=\"img-responsive\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2013\/ClusterOptionPricing.png\" onError=\"this.style.display ='none';\" \/><\/div><!--introduction--><p>Today I'd like to introduce James Lebak. James is a developer who works on GPU support in the Parallel Computing Toolbox.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2013\/06\/24\/running-monte-carlo-simulations-on-multiple-gpus\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[55,34],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/709"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=709"}],"version-history":[{"count":10,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/709\/revisions"}],"predecessor-version":[{"id":1908,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/709\/revisions\/1908"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=709"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=709"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=709"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}