{"id":346,"date":"2012-02-06T11:41:38","date_gmt":"2012-02-06T16:41:38","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=346"},"modified":"2018-10-02T10:46:45","modified_gmt":"2018-10-02T15:46:45","slug":"using-gpus-in-matlab","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2012\/02\/06\/using-gpus-in-matlab\/","title":{"rendered":"Using GPUs in MATLAB"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <introduction>\r\n      <p>Today I&#8217;d like to introduce guest blogger <a href=\"mailto:sarah.zaranek@mathworks.com\">Sarah Wait Zaranek<\/a> who works for the MATLAB Marketing team. Sarah previously has <a href=\"https:\/\/blogs.mathworks.com\/loren\/2008\/06\/25\/speeding-up-matlab-applications\/\">written<\/a> about speeding up code from a customer to get acceptable performance. She will be discussing how to use GPUs to accelerate\r\n         your MATLAB applications.\r\n      <\/p>\r\n   <\/introduction>\r\n   <h3>Contents<\/h3>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"#1\">Overview<\/a><\/li>\r\n         <li><a href=\"#2\">GPU Background<\/a><\/li>\r\n         <li><a href=\"#5\">Learning About Our GPU<\/a><\/li>\r\n         <li><a href=\"#7\">A Simple Example using Overloaded Functions<\/a><\/li>\r\n         <li><a href=\"#12\">Speedup For Our Simple Example<\/a><\/li>\r\n         <li><a href=\"#14\">Understanding and Limiting Your Data Transfer Overhead<\/a><\/li>\r\n         <li><a href=\"#16\">Solving the Wave Equation<\/a><\/li>\r\n         <li><a href=\"#20\">Code Changes to Run Algorithm on GPU<\/a><\/li>\r\n         <li><a href=\"#21\">Code Changes to Transfer Data to the GPU and Back Again<\/a><\/li>\r\n         <li><a href=\"#24\">Comparing CPU and GPU Execution Speeds<\/a><\/li>\r\n         <li><a href=\"#27\">Other Ways to Use the GPU with MATLAB<\/a><\/li>\r\n         <li><a href=\"#28\">Your Thoughts?<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <h3>Overview<a name=\"1\"><\/a><\/h3>\r\n   <p>In this post, we first will introduce the basics of using the GPU with MATLAB and then move onto solving a 2nd-order wave\r\n      equation using this GPU functionality. This blog post is inspired by a recent MATLAB Digest <a href=\"https:\/\/www.mathworks.com\/company\/newsletters\/articles\/gpu-programming-in-matlab.html\">article<\/a> on GPU Computing that I coauthored with one of our developers, Jill Reese. Since the original demo was made, the GPU functions\r\n      available in MATLAB have grown. If you compare the code below to the code in the paper- they are slightly different, reflecting\r\n      these new capabilites. Additionally, small changes were made to enable easier explanation of the code in this blog format.\r\n      The GPU functionality shown in this post requires the Parallel Computing Toolbox.\r\n   <\/p>\r\n   <h3>GPU Background<a name=\"2\"><\/a><\/h3>\r\n   <p>Originally used to accelerate graphics rendering, GPUs are now increasingly applied to scientific calculations. Unlike a traditional\r\n      CPU, which includes no more than a handful of cores, a GPU has a massively parallel array of integer and floating-point processors,\r\n      as well as dedicated, high-speed memory. A typical GPU comprises hundreds of these smaller processors.  These processors can\r\n      be used to greatly speed-up particular types of applications.\r\n   <\/p>\r\n   <p>A good rule of thumb is that your problem may be a good fit for the GPU if it is:<\/p>\r\n   <div>\r\n      <ul>\r\n         <li>Massively parallel: The computations can be broken down into hundreds or thousands of independent units of work.  You will\r\n            see the best performance when all of the cores are kept busy, exploiting the inherent parallel nature of the GPU. Seemingly\r\n            simple, vectorized MATLAB calculations on arrays with hundreds of thousands of elements often can fit into this category.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <div>\r\n      <ul>\r\n         <li>Computationally intensive: The time spent on computation significantly exceeds the time spent on transferring data to and\r\n            from GPU memory. Because a GPU is attached to the host CPU via the PCI Express bus, the memory access is slower than with\r\n            a traditional CPU. This means that your overall computational speedup is limited by the amount of data transfer that occurs\r\n            in your algorithm.\r\n         <\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>Applications that do not satisfy these criteria might actually run more slowly on a GPU than on a CPU.<\/p>\r\n   <h3>Learning About Our GPU<a name=\"5\"><\/a><\/h3>\r\n   <p>With that background, we can now start working with the GPU in MATLAB. Let's start first by querying our GPU to see just what\r\n      we are working with:\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">gpuDevice<\/pre><pre style=\"font-style:oblique\">ans = \r\n  parallel.gpu.CUDADevice handle\r\n  Package: parallel.gpu\r\n\r\n  Properties:\r\n                      Name: 'Tesla C2050 \/ C2070'\r\n                     Index: 1\r\n         ComputeCapability: '2.0'\r\n            SupportsDouble: 1\r\n             DriverVersion: 4\r\n        MaxThreadsPerBlock: 1024\r\n          MaxShmemPerBlock: 49152\r\n        MaxThreadBlockSize: [1024 1024 64]\r\n               MaxGridSize: [65535 65535]\r\n                 SIMDWidth: 32\r\n               TotalMemory: 3.1820e+009\r\n                FreeMemory: 2.6005e+009\r\n       MultiprocessorCount: 14\r\n              ClockRateKHz: 1147000\r\n               ComputeMode: 'Default'\r\n      GPUOverlapsTransfers: 1\r\n    KernelExecutionTimeout: 1\r\n          CanMapHostMemory: 1\r\n           DeviceSupported: 1\r\n            DeviceSelected: 1\r\n<\/pre><p>We are running on a Tesla C2050.  Currently, Parallel Computing Toolbox supports NVDIA GPUs with Compute Capability 1.3 or\r\n      greater. <a href=\"http:\/\/www.nvidia.com\/object\/cuda_gpus.html\">Here<\/a> is a link to see if your card is supported.\r\n   <\/p>\r\n   <h3>A Simple Example using Overloaded Functions<a name=\"7\"><\/a><\/h3>\r\n   <p>Over 100 operations (e.g. <tt>fft<\/tt>, <tt>ifft<\/tt>, <tt>eig<\/tt>) are now available as built-in MATLAB functions that can be executed directly on the GPU by providing an input argument of\r\n      the type GPUArray. These GPU-enabled functions are overloaded&#8212;in other words, they operate differently depending on the data\r\n      type of the arguments passed to them. <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/distcomp\/bsic4fr-1.html\">Here<\/a> is a list of all the overloaded functions.\r\n   <\/p>\r\n   <p>Let's create a GPUArray and perform a <tt>fft<\/tt> using the GPU. However, let's first do this on the CPU so that we can see the difference in code and performance\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">A1 = rand(3000,3000);\r\ntic;\r\nB1 = fft(A1);\r\ntime1 = toc;<\/pre><p>To perform the same operation on the GPU, we first use <tt>gpuArray<\/tt> to transfer data from the MATLAB workspace to device memory. Then we can run <tt>fft<\/tt>, which is one of the overloaded functions on that data:\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">A2 = gpuArray(A1);\r\ntic;\r\nB2 = fft(A2);\r\ntime2 = toc;<\/pre><p>The second case is executed on the GPU rather than the CPU since its input (a GPUArray) is held on the GPU. The result, <tt>B2<\/tt>, is stored on the GPU. However, it is still visible in the MATLAB workspace. By running <tt>class(B2)<\/tt>, we can see that it is also a GPUArray.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">class(B2)<\/pre><pre style=\"font-style:oblique\">ans =\r\nparallel.gpu.GPUArray\r\n<\/pre><p>To bring the data back to the CPU, we run <tt>gather<\/tt>.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">B2 = gather(B2);\r\nclass(B2)<\/pre><pre style=\"font-style:oblique\">ans =\r\ndouble\r\n<\/pre><p><tt>B2<\/tt> is now on the CPU and has a class of double.\r\n   <\/p>\r\n   <h3>Speedup For Our Simple Example<a name=\"12\"><\/a><\/h3>\r\n   <p>In this simple example, we can calculate the speedup of performing the <tt>fft<\/tt> on our data.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">speedUp = time1\/time2;\r\ndisp(speedUp)<\/pre><pre style=\"font-style:oblique\">    3.6122\r\n<\/pre><p>It looks like our <tt>fft<\/tt> is running ~3.5x faster on the GPU. This is pretty good, especially since <tt>fft<\/tt> is multi-threaded in core MATLAB. However, to perform a realistic comparison, we should really include the time spent transferring\r\n      the vector to and from the GPU. If we do that - we find that our acceleration is greatly reduced. Let's see what happens if\r\n      we include the time to do our data transfer.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">tic;\r\nA3 = gpuArray(A1);\r\nB3 = fft(A3);\r\nB3 = gather(B3);\r\ntime3 = toc;\r\n\r\nspeedUp = time1\/time3;\r\ndisp(speedUp)<\/pre><pre style=\"font-style:oblique\">    0.5676\r\n<\/pre><h3>Understanding and Limiting Your Data Transfer Overhead<a name=\"14\"><\/a><\/h3>\r\n   <p>Data transfer overhead can become so significant that it degrades the application's overall performance, especially if you\r\n      repeatedly exchange data between the CPU and GPU to execute relatively few computationally intensive operations. However not\r\n      all hope is lost!  By limiting the data transfer between the GPU and the CPU, we can still acheive speedup.\r\n   <\/p>\r\n   <p>Instead of creating the data on the CPU and transferring it to the GPU - we can just create on the GPU directly.  There are\r\n      several <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/distcomp\/bsic4fr-1.html\">constructors<\/a> available as well as constructor-like functions such as <tt>meshgrid<\/tt>. Let's see how that effects our time.  To be accurate, we should also retime the original serial code including the time\r\n      it takes to generate the random matrix on the CPU.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">tic;\r\nA4 = rand(3000,3000);\r\nB4 = fft(A4);\r\ntime4 = toc;\r\n\r\ntic;\r\nA5 = parallel.gpu.GPUArray.rand(3000,3000);\r\nB5 = fft(A5);\r\nB5 = gather(B5);\r\ntime5 = toc;\r\n\r\nspeedUp = time4\/time5;\r\ndisp(speedUp);<\/pre><pre style=\"font-style:oblique\">    1.4100\r\n<\/pre><p>This is better, although we still do see the influence of gathering the data back from the GPU.  In this simple demo, the\r\n      effect is exaggerated because we are running a single, already fast operation on the GPU. It is more efficient to perform\r\n      several operations on the data while it is on the GPU, bringing the data back to the CPU only when required.\r\n   <\/p>\r\n   <h3>Solving the Wave Equation<a name=\"16\"><\/a><\/h3>\r\n   <p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/346\/WaveEquation.gif\"> <\/p>\r\n   <p>To put the above example into context, let's use the same GPU functionality on a more \"real-world\" problem. To do this,  we\r\n      are going to solve a second-order wave equation:\r\n   <\/p>\r\n   <p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/346\/UsingGPUs_eq30596.png\"> <\/p>\r\n   <p>The algorithm we use to solve the wave equation combines a spectral method in space and a second-order central finite difference\r\n      method in time. Specifically, we apply the Chebyshev spectral method, which uses Chebyshev polynomials as the basis functions.\r\n      Our example is largely based on an example in Trefethen's book: <a title=\"https:\/\/www.mathworks.com\/academia\/books\/book48110.html (link no longer works)\">\"Spectral Methods in MATLAB\"<\/a><\/p>\r\n   <h3>Code Changes to Run Algorithm on GPU<a name=\"20\"><\/a><\/h3>\r\n   <p>When accelerating our alogrithm, we focus on speeding up the code within the main time stepping while-loop. The operations\r\n      in that part of the code (e.g. <tt>fft<\/tt> and <tt>ifft<\/tt>, matrix multiplication) are all overloaded functions that work with the GPU. As a result, we do not need to change the algorithm\r\n      in any way to execute it on a GPU. We get to simply transfer the data to the GPU and back to the CPU when finished.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">type(<span style=\"color: #A020F0\">'stepSolution.m'<\/span>)<\/pre><pre style=\"font-style:oblique\">\r\nfunction [vv,vvold] = stepsolution(vv,vvold,ii,N,dt,W1T,W2T,W3T,W4T,...\r\n                                        WuxxT1,WuxxT2,WuyyT1,WuyyT2)\r\n    V = [vv(ii,:) vv(ii,N:-1:2)];\r\n    U = real(fft(V.')).';\r\n    \r\n    W1test = (U.*W1T).';\r\n    W2test = (U.*W2T).';\r\n    W1 = (real(ifft(W1test))).';\r\n    W2 = (real(ifft(W2test))).';\r\n    \r\n    % Calculating 2nd derivative in x\r\n    uxx(ii,ii) = W2(:,ii).* WuxxT1 - W1(:,ii).*WuxxT2;\r\n    uxx([1,N+1],[1,N+1]) = 0;\r\n    \r\n    V = [vv(:,ii); vv((N:-1:2),ii)];\r\n    U = real(fft(V));\r\n    \r\n    W1 = real(ifft(U.*W3T));\r\n    W2 = real(ifft(U.*W4T));\r\n    \r\n    % Calculating 2nd derivative in y\r\n    uyy(ii,ii) = W2(ii,:).* WuyyT1 - W1(ii,:).*WuyyT2;\r\n    uyy([1,N+1],[1,N+1]) = 0;\r\n    \r\n    % Computing new value using 2nd order central finite difference in\r\n    % time\r\n    vvnew = 2*vv - vvold + dt*dt*(uxx+uyy);\r\n    vvold = vv; vv = vvnew;\r\n\r\nend\r\n\r\n\r\n<\/pre><h3>Code Changes to Transfer Data to the GPU and Back Again<a name=\"21\"><\/a><\/h3>\r\n   <p>The variables <tt>dt<\/tt>, <tt>x<\/tt>, <tt>index1<\/tt>, and <tt>index2<\/tt> are calculated on the CPU and transferred over to the GPU using <tt>gpuArray<\/tt>.\r\n   <\/p>\r\n   <p>For example:<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">N = 256;\r\nindex1 = 1i*[0:N-1 0 1-N:-1];\r\n\r\nindex1 = gpuArray(index1);<\/pre><p>For the rest of the variables, we create them directly on the GPU using the overloaded versions of the transpose operator\r\n      (<tt>'<\/tt>), <tt>repmat<\/tt> ,and <tt>meshgrid<\/tt>.\r\n   <\/p>\r\n   <p>For example:<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">W1T = repmat(index1,N-1,1);<\/pre><p>When, we are done with all our calculations on the GPU, we use <tt>gather<\/tt> once to bring data back from the GPU. Note, that we don't have to transfer our data back to the CPU between time steps. This\r\n      all comes together to look like this:\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">type <span style=\"color: #A020F0\">WaveGPU.m<\/span><\/pre><pre style=\"font-style:oblique\">\r\nfunction vvg = WaveGPU(N,Nsteps)\r\n%% Solving 2nd Order Wave Equation Using Spectral Methods\r\n% This example solves a 2nd order wave equation: utt = uxx + uyy, with u =\r\n% 0 on the boundaries. It uses a 2nd order central finite difference in\r\n% time and a Chebysehv spectral method in space (using FFT). \r\n%\r\n% The code has been modified from an example in Spectral Methods in MATLAB\r\n% by Trefethen, Lloyd N.\r\n\r\n% Points in X and Y\r\nx = cos(pi*(0:N)\/N); % using Chebyshev points\r\n\r\n% Send x to the GPU\r\nx = gpuArray(x);\r\ny = x';\r\n\r\n% Calculating time step\r\ndt = 6\/N^2;\r\n\r\n% Setting up grid\r\n[xx,yy] = meshgrid(x,y);\r\n\r\n% Calculate initial values\r\nvv = exp(-40*((xx-.4).^2 + yy.^2));\r\nvvold = vv;\r\n\r\nii = 2:N;\r\nindex1 = 1i*[0:N-1 0 1-N:-1];\r\nindex2 = -[0:N 1-N:-1].^2;\r\n\r\n% Sending data to the GPU\r\ndt = gpuArray(dt);\r\nindex1 = gpuArray(index1);\r\nindex2 = gpuArray(index2);\r\n\r\n% Creating weights used for spectral differentiation \r\nW1T = repmat(index1,N-1,1);\r\nW2T = repmat(index2,N-1,1);\r\nW3T = repmat(index1.',1,N-1);\r\nW4T = repmat(index2.',1,N-1);\r\n\r\nWuxxT1 = repmat((1.\/(1-x(ii).^2)),N-1,1);\r\nWuxxT2 = repmat(x(ii).\/(1-x(ii).^2).^(3\/2),N-1,1);\r\nWuyyT1 = repmat(1.\/(1-y(ii).^2),1,N-1);\r\nWuyyT2 = repmat(y(ii).\/(1-y(ii).^2).^(3\/2),1,N-1);\r\n\r\n% Start time-stepping\r\nn = 0;\r\n\r\nwhile n &lt; Nsteps\r\n    [vv,vvold] = stepsolution(vv,vvold,ii,N,dt,W1T,W2T,W3T,W4T,...\r\n                                 WuxxT1,WuxxT2,WuyyT1,WuyyT2);\r\n    n = n + 1;\r\nend\r\n\r\n\r\n% Gather vvg back from GPU memory when done\r\nvvg = gather(vv);\r\n\r\n\r\n<\/pre><h3>Comparing CPU and GPU Execution Speeds<a name=\"24\"><\/a><\/h3>\r\n   <p>We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512,\r\n      1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.\r\n   <\/p>\r\n   <p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/346\/Benchmark.png\"> <\/p>\r\n   <p>For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than\r\n      10 seconds on the GPU.  The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves\r\n      and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.\r\n   <\/p>\r\n   <h3>Other Ways to Use the GPU with MATLAB<a name=\"27\"><\/a><\/h3>\r\n   <p>To accelerate an algorithm with multiple element-wise operations on a GPU, you can use <tt>arrayfun<\/tt>, which applies a function to each element of an GPUarray. Additionally if you have your own CUDA code, you can use the CUDAKernel\r\n      interface to integrate this code with MATLAB.\r\n   <\/p>\r\n   <p>Ben Tordoff's previous guest blog post <a href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\/\">Mandelbrots Sets on the GPU<\/a> shows a great example using both of these capabilities.\r\n   <\/p>\r\n<h3>Your Thoughts?<a name=\"28\"><\/a><\/h3>\r\n   <p>Have you experimented with our new GPU functionality?  Let us know what you think or if you have any questions by leaving\r\n      a comment for this post <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=346#respond\">here<\/a>.\r\n   <\/p><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_43a6565ff583451ea3dcbf9a94261aab() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='43a6565ff583451ea3dcbf9a94261aab ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 43a6565ff583451ea3dcbf9a94261aab';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Sarah Wait Zaranek';\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_43a6565ff583451ea3dcbf9a94261aab()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.13<br><\/p>\r\n<\/div>\r\n<!--\r\n43a6565ff583451ea3dcbf9a94261aab ##### SOURCE BEGIN #####\r\n%% Using GPUs in MATLAB\r\n% Today I\u00e2\u20ac&#x2122;d like to introduce guest blogger\r\n% <mailto:sarah.zaranek@mathworks.com Sarah Wait Zaranek> who works for the\r\n% MATLAB Marketing team. Sarah previously has\r\n% <https:\/\/blogs.mathworks.com\/loren\/2008\/06\/25\/speeding-up-matlab-applications\/\r\n% written> about speeding up code from a customer to get acceptable\r\n% performance. She will be discussing how to use GPUs to\r\n% accelerate your MATLAB applications.\r\n\r\n%% Overview  \r\n% In this post, we first will introduce the basics of using the GPU with\r\n% MATLAB and then move onto solving a 2nd-order wave equation using this\r\n% GPU functionality. This blog post is inspired by a recent MATLAB Digest\r\n% <https:\/\/www.mathworks.com\/company\/newsletters\/articles\/gpu-programming-in-matlab.html\r\n% article> on GPU Computing that I coauthored with one of our developers,\r\n% Jill Reese. Since the original demo was made, the GPU functions available\r\n% in MATLAB have grown. If you compare the code below to the code in the\r\n% paper- they are slightly different, reflecting these new capabilites.\r\n% Additionally, small changes were made to enable easier explanation of the\r\n% code in this blog format. The GPU functionality shown in this\r\n% post requires the Parallel Computing Toolbox.\r\n%\r\n%% GPU Background\r\n% Originally used to accelerate graphics rendering, GPUs are now\r\n% increasingly applied to scientific calculations. Unlike a traditional\r\n% CPU, which includes no more than a handful of cores, a GPU has a\r\n% massively parallel array of integer and floating-point processors, as\r\n% well as dedicated, high-speed memory. A typical GPU comprises hundreds of\r\n% these smaller processors.  These processors can be used to greatly\r\n% speed-up particular types of applications.\r\n%\r\n% A good rule of thumb is that your problem may be a good fit for the GPU\r\n% if it is:\r\n%%\r\n% * Massively parallel: \r\n% The computations can be broken down into hundreds or thousands of\r\n% independent units of work.  You will see the best performance when all of\r\n% the cores are kept busy, exploiting the inherent parallel nature of the\r\n% GPU. Seemingly simple, vectorized MATLAB calculations on arrays with \r\n% hundreds of thousands of elements often can fit into this category. \r\n%%\r\n% * Computationally intensive: \r\n% The time spent on computation significantly exceeds the time spent on\r\n% transferring data to and from GPU memory. Because a GPU is attached to\r\n% the host CPU via the PCI Express bus, the memory access is slower than\r\n% with a traditional CPU. This means that your overall computational\r\n% speedup is limited by the amount of data transfer that occurs in your\r\n% algorithm.\r\n%\r\n% Applications that do not satisfy these criteria might actually run more\r\n% slowly on a GPU than on a CPU.\r\n\r\n%% Learning About Our GPU\r\n% With that background, we can now start working with the GPU in MATLAB.\r\n% Let's start first by querying our GPU to see just what we are working\r\n% with:\r\n\r\ngpuDevice\r\n\r\n%%\r\n% We are running on a Tesla C2050.  Currently, Parallel Computing Toolbox\r\n% supports NVDIA GPUs with Compute Capability 1.3 or greater. <http:\/\/www.nvidia.com\/object\/cuda_gpus.html\r\n% Here> is a link to see if your card is supported. \r\n\r\n%% A Simple Example using Overloaded Functions \r\n% Over 100 operations (e.g. |fft|, |ifft|, |eig|) are now available as\r\n% built-in MATLAB functions that can be executed directly on the GPU by\r\n% providing an input argument of the type GPUArray. These GPU-enabled\r\n% functions are overloaded\u00e2\u20ac\u201din other words, they operate differently\r\n% depending on the data type of the arguments passed to them.\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/distcomp\/bsic4fr-1.html\r\n% Here> is a list of all the overloaded functions.\r\n%\r\n% Let's create a GPUArray and perform a |fft| using the GPU. However,\r\n% let's first do this on the CPU so that we can see the difference in code\r\n% and performance\r\n\r\nA1 = rand(3000,3000); \r\ntic;\r\nB1 = fft(A1);\r\ntime1 = toc;\r\n\r\n%%\r\n% To perform the same operation on the GPU, we first use |gpuArray|\r\n% to transfer data from the MATLAB workspace to device memory. Then\r\n% we can run |fft|, which is one of the overloaded functions on that data:\r\n\r\nA2 = gpuArray(A1); \r\ntic;\r\nB2 = fft(A2);\r\ntime2 = toc;\r\n\r\n%%\r\n% The second case is executed on the GPU rather than the CPU since its\r\n% input (a GPUArray) is held on the GPU. The result, |B2|, is stored on the \r\n% GPU. However, it is still visible in the MATLAB workspace. By running \r\n% |class(B2)|, we can see that it is also a GPUArray.\r\n\r\nclass(B2)\r\n\r\n%%\r\n% To bring the data back to the CPU, we run |gather|.\r\n\r\nB2 = gather(B2);\r\nclass(B2)\r\n\r\n%%\r\n% |B2| is now on the CPU and has a class of double. \r\n\r\n%% Speedup For Our Simple Example\r\n%\r\n% In this simple example, we can calculate the speedup of performing the\r\n% |fft| on our data.  \r\n%\r\nspeedUp = time1\/time2;\r\ndisp(speedUp)\r\n\r\n%%\r\n% It looks like our |fft| is running ~3.5x faster on the GPU. This is\r\n% pretty good, especially since |fft| is multi-threaded in core MATLAB.\r\n% However, to perform a realistic comparison, we should really include the\r\n% time spent transferring the vector to and from the GPU. If we do that\r\n% - we find that our acceleration is greatly reduced. Let's see what\r\n% happens if we include the time to do our data transfer.\r\n\r\ntic;\r\nA3 = gpuArray(A1); \r\nB3 = fft(A3);\r\nB3 = gather(B3);\r\ntime3 = toc;\r\n\r\nspeedUp = time1\/time3;\r\ndisp(speedUp)\r\n\r\n%% Understanding and Limiting Your Data Transfer Overhead \r\n%\r\n% Data transfer overhead can become so significant that it degrades the\r\n% application's overall performance, especially if you repeatedly exchange\r\n% data between the CPU and GPU to execute relatively few computationally\r\n% intensive operations. However not all hope is lost!  By limiting the data\r\n% transfer between the GPU and the CPU, we can still acheive speedup.\r\n%\r\n% Instead of creating the data on the CPU and transferring it to the GPU -\r\n% we can just create on the GPU directly.  There are several \r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011b\/toolbox\/distcomp\/bsic4fr-1.html\r\n% constructors> available as well as constructor-like functions such as |meshgrid|.\r\n% Let's see how that effects our time.  To be accurate, we should also\r\n% retime the original serial code including the time it takes to generate\r\n% the random matrix on the CPU.\r\n\r\ntic;\r\nA4 = rand(3000,3000); \r\nB4 = fft(A4);\r\ntime4 = toc;\r\n\r\ntic;\r\nA5 = parallel.gpu.GPUArray.rand(3000,3000); \r\nB5 = fft(A5);\r\nB5 = gather(B5);\r\ntime5 = toc;\r\n\r\nspeedUp = time4\/time5;\r\ndisp(speedUp);\r\n\r\n%%\r\n% This is better, although we still do see the influence of gathering the\r\n% data back from the GPU.  In this simple demo, the effect is exaggerated\r\n% because we are running a single, already fast operation on the GPU. It is\r\n% more efficient to perform several operations on the data while it is on\r\n% the GPU, bringing the data back to the CPU only when required. \r\n\r\n%% Solving the Wave Equation \r\n% <<WaveEquation.gif>>\r\n%%\r\n% To put the above example into context, let's use the same GPU\r\n% functionality on a more \"real-world\" problem. To do this,  we are going\r\n% to solve a second-order wave equation:\r\n%%\r\n% $$ \\frac{\\partial u}{\\partial t} =  \\frac{\\partial^2 u}{\\partial x^2} + \\frac{\\partial^2 u}{\\partial y^2} $$\r\n\r\n%%\r\n% The algorithm we use to solve the wave equation combines a spectral\r\n% method in space and a second-order central finite differenc method in\r\n% time. Specifically, we apply the Chebyshev spectral method, which uses\r\n% Chebyshev polynomials as the basis functions. Our example is largely\r\n% based on an example in Trefethen's book:\r\n% <https:\/\/www.mathworks.com\/academia\/books\/book48110.html \"Spectral Methods\r\n% in MATLAB\">\r\n%\r\n%% Code Changes to Run Algorithm on GPU\r\n% When accelerating our alogrithm, we focus on speeding up the code within\r\n% the main time stepping while-loop. The operations in that part of the\r\n% code (e.g. |fft| and |ifft|, matrix multiplication) are all overloaded\r\n% functions that work with the GPU. As a result, we do not need to change\r\n% the algorithm in any way to execute it on a GPU. We get to simply\r\n% transfer the data to the GPU and back to the CPU when finished.\r\n\r\ntype('stepSolution.m')\r\n\r\n%% Code Changes to Transfer Data to the GPU and Back Again\r\n% The variables |dt|, |x|, |index1|, and |index2| are\r\n% calculated on the CPU and transferred over to the GPU using |gpuArray|.\r\n%\r\n% For example:\r\n\r\nN = 256;  \r\nindex1 = 1i*[0:N-1 0 1-N:-1];\r\n\r\nindex1 = gpuArray(index1);\r\n\r\n%%\r\n% For the rest of the variables, we create them directly on the GPU using\r\n% the overloaded versions of the transpose operator (|'|), |repmat| ,and\r\n% |meshgrid|. \r\n%\r\n% For example:\r\n\r\nW1T = repmat(index1,N-1,1);\r\n\r\n%%\r\n% When, we are done with all our calculations on the GPU, we use |gather|\r\n% once to bring data back from the GPU. Note, that we don't have to\r\n% transfer our data back to the CPU between time steps. This all comes\r\n% together to look like this:\r\n\r\ntype WaveGPU.m\r\n\r\n%% Comparing CPU and GPU Execution Speeds\r\n% We ran a benchmark in which we measured the amount of time it took to\r\n% execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on\r\n% an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.\r\n%\r\n%%\r\n% <<Benchmark.png>>\r\n\r\n%% \r\n% For a grid size of 2048, the algorithm shows a 7.5x decrease in compute\r\n% time from more than a minute on the CPU to less than 10 seconds on the\r\n% GPU.  The log scale plot shows that the CPU is actually faster\r\n% for small grid sizes. As the technology evolves and matures, however, GPU\r\n% solutions are increasingly able to handle smaller problems, a trend that\r\n% we expect to continue.\r\n% \r\n%% Other Ways to Use the GPU with MATLAB \r\n% To accelerate an algorithm with multiple element-wise operations on a\r\n% GPU, you can use |arrayfun|, which applies a function to each element of\r\n% an GPUarray. Additionally if you have your own CUDA code, you can use the\r\n% CUDAKernel interface to integrate this code with MATLAB.\r\n% \r\n% Ben Tordoff's previous guest blog post\r\n% <https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\/\r\n% Mandelbrots Sets on the GPU> shows a great example using both of these\r\n% capabilities.\r\n\r\n%% Your Thoughts?\r\n% Have you experimented with our new GPU functionality?  Let us know what\r\n% you think or if you have any questions by leaving a comment for this\r\n% post <https:\/\/blogs.mathworks.com\/loren\/?p=346#respond here>.\r\n\r\n##### SOURCE END ##### 43a6565ff583451ea3dcbf9a94261aab\r\n-->\r\n","protected":false},"excerpt":{"rendered":"<p>\r\n   \r\n      Today I&#8217;d like to introduce guest blogger Sarah Wait Zaranek who works for the MATLAB Marketing team. Sarah previously has written about speeding up code from a customer to get... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2012\/02\/06\/using-gpus-in-matlab\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[55,6,34],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/346"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=346"}],"version-history":[{"count":10,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/346\/revisions"}],"predecessor-version":[{"id":3095,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/346\/revisions\/3095"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=346"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=346"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=346"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}