{"id":282,"date":"2011-07-18T16:36:19","date_gmt":"2011-07-18T16:36:19","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\/"},"modified":"2013-06-17T13:49:51","modified_gmt":"2013-06-17T18:49:51","slug":"a-mandelbrot-set-on-the-gpu","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\/","title":{"rendered":"A Mandelbrot Set on the GPU"},"content":{"rendered":"<div xmlns:mwsh=\"https:\/\/www.mathworks.com\/namespace\/mcode\/v1\/syntaxhighlight.dtd\" class=\"content\">\r\n   <introduction>\r\n      <p>Today I welcome back guest blogger <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/80363\">Ben Tordoff<\/a> who last blogged here on how to <a href=\"https:\/\/blogs.mathworks.com\/loren\/2009\/12\/16\/carving-a-dinosaur\">Carve a Dinosaur<\/a>. These days Ben works on the <a href=\"https:\/\/www.mathworks.com\/products\/parallel-computing\">Parallel Computing Toolbox&#8482;<\/a> with particular focus on GPUs.\r\n      <\/p>\r\n   <\/introduction>\r\n   <h3>Contents<\/h3>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"#1\">About this demo<\/a><\/li>\r\n         <li><a href=\"#2\">Setup<\/a><\/li>\r\n         <li><a href=\"#3\">The Mandelbrot Set in MATLAB<\/a><\/li>\r\n         <li><a href=\"#4\">Using parallel.gpu.GPUArray - 16 Times Faster<\/a><\/li>\r\n         <li><a href=\"#5\">Using Element-wise Operation - 164 Times Faster<\/a><\/li>\r\n         <li><a href=\"#6\">Working in CUDA - 340 Times Faster<\/a><\/li>\r\n         <li><a href=\"#7\">Summary<\/a><\/li>\r\n         <li><a href=\"#8\">References<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <h3>About this demo<a name=\"1\"><\/a><\/h3>\r\n   <p>Benoit Mandelbrot died last Autumn, and despite inventing the term \"fractal\" and being one of the pioneers of fractal geometry,\r\n      his lasting legacy for most people will undoubtedly be the fractal that bears his name.  There can be few mathematicians or\r\n      mathematical algorithms that have adorned as many teenage bedroom walls as Mandelbrot's set did in the late '80s and early\r\n      '90s.\r\n   <\/p>\r\n   <p>I also suspect there is no other fractal that has been implemented in quite as many different languages on different platforms.\r\n      My first version was a blocky 32x24 version on an aging ZX Spectrum with seven glorious colours.  These days we have a bit\r\n      more graphical power available, and a few more colours.  In fact, as we shall see, an algorithm like the Mandelbrot Set is\r\n      ideally suited to running on that GPU (Graphics Processing Unit) which mostly sits idle in your PC.\r\n   <\/p>\r\n   <p>As a starting point I will use a version of the Mandelbrot set taken loosely from Cleve Moler's <a href=\"https:\/\/www.mathworks.com\/moler\/exm\/chapters.html\">Experiments with MATLAB<\/a> e-book.  We will then look at the three different ways that the Parallel Computing Toolbox&#8482; provides for speeding this up\r\n      using the GPU:\r\n   <\/p>\r\n   <ol>\r\n   <li>Using the existing algorithm but with GPU data as input<\/li>\r\n   <li>Using <tt>arrayfun<\/tt> to perform the algorithm on each element independently\r\n   <\/li>\r\n   <li>Using the MATLAB\/CUDA interface to run some existing CUDA\/C++ code<\/li>\r\n   <\/ol>\r\n   <p>As you will see, these three methods are of increasing difficulty but repay the effort with hugely increased speed. Choosing\r\n      a method to trade off difficulty and speed is typical of parallel problems.\r\n   <\/p>\r\n   <h3>Setup<a name=\"2\"><\/a><\/h3>\r\n   <p>The values below specify a highly zoomed part of the Mandelbrot Set in the valley between the main cardioid and the p\/q bulb\r\n      to its left.\r\n   <\/p>\r\n   <p><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/282\/paralleldemo_gpu_mandelbrot_location.png\"> <\/p>\r\n   <p>The following code forms the set of starting points for each of the calculations by creating a 1000x1000 grid of real parts\r\n      (X) and imaginary parts (Y) between these limits. For this particular location I happen to know that 500 iterations will be\r\n      enough to draw a nice picture.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">maxIterations = 500;\r\ngridSize = 1000;\r\nxlim = [-0.748766713922161, -0.748766707771757];\r\nylim = [ 0.123640844894862,  0.123640851045266];<\/pre><h3>The Mandelbrot Set in MATLAB<a name=\"3\"><\/a><\/h3>\r\n   <p>Below is an implementation of the Mandelbrot Set using standard MATLAB commands running on the CPU. This calculation is vectorized\r\n      such that every location is updated at once.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">% Setup<\/span>\r\nt = tic();\r\nx = linspace( xlim(1), xlim(2), gridSize );\r\ny = linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\nz0 = xGrid + 1i*yGrid;\r\ncount = zeros( size(z0) );\r\n\r\n<span style=\"color: #228B22\">% Calculate<\/span>\r\nz = z0;\r\n<span style=\"color: #0000FF\">for<\/span> n = 0:maxIterations\r\n    z = z.*z + z0;\r\n    inside = abs( z )&lt;=2;\r\n    count = count + inside;\r\n<span style=\"color: #0000FF\">end<\/span>\r\ncount = log( count+1 );\r\n\r\n<span style=\"color: #228B22\">% Show<\/span>\r\ncpuTime = toc( t );\r\nset( gcf, <span style=\"color: #A020F0\">'Position'<\/span>, [200 200 600 600] );\r\nimagesc( x, y, count );\r\naxis <span style=\"color: #A020F0\">image<\/span>\r\ncolormap( [jet();flipud( jet() );0 0 0] );\r\ntitle( sprintf( <span style=\"color: #A020F0\">'%1.2fsecs (without GPU)'<\/span>, cpuTime ) );<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/282\/gpu_mandelbrot_blog_01.png\"> <h3>Using parallel.gpu.GPUArray - 16 Times Faster<a name=\"4\"><\/a><\/h3>\r\n   <p><a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/gpuarray.html\"><tt>parallel.gpu.GPUArray<\/tt><\/a> provides GPU versions of many functions that can be used to create data arrays, including the <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/linspace.html\"><tt>linspace<\/tt><\/a>, <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/logspace.html\"><tt>logspace<\/tt><\/a>, and <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/meshgrid.html\"><tt>meshgrid<\/tt><\/a> functions needed here.  Similarly, the <tt>count<\/tt> array is initialized directly on the GPU using the function <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/zeros.html\"><tt>parallel.gpu.GPUArray.zeros<\/tt><\/a>.\r\n   <\/p>\r\n   <p>Other than these simple changes to the data initialization, the algorithm is unchanged (and &gt;16 times faster):<\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">% Setup<\/span>\r\nt = tic();\r\nx = parallel.gpu.GPUArray.linspace( xlim(1), xlim(2), gridSize );\r\ny = parallel.gpu.GPUArray.linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\nz0 = complex( xGrid, yGrid );\r\ncount = parallel.gpu.GPUArray.zeros( size(z0) );\r\n\r\n<span style=\"color: #228B22\">% Calculate<\/span>\r\nz = z0;\r\n<span style=\"color: #0000FF\">for<\/span> n = 0:maxIterations\r\n    z = z.*z + z0;\r\n    inside = abs( z )&lt;=2;\r\n    count = count + inside;\r\n<span style=\"color: #0000FF\">end<\/span>\r\ncount = log( count+1 );\r\n\r\n<span style=\"color: #228B22\">% Show<\/span>\r\nnaiveGPUTime = toc( t );\r\nimagesc( x, y, count )\r\naxis <span style=\"color: #A020F0\">image<\/span>\r\ntitle( sprintf( <span style=\"color: #A020F0\">'%1.2fsecs (naive GPU) = %1.1fx faster'<\/span>, <span style=\"color: #0000FF\">...<\/span>\r\n    naiveGPUTime, cpuTime\/naiveGPUTime ) )<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/282\/gpu_mandelbrot_blog_02.png\"> <h3>Using Element-wise Operation - 164 Times Faster<a name=\"5\"><\/a><\/h3>\r\n   <p>Noticing that the algorithm is operating equally on every element of the input, we can place the code in a helper function\r\n      and call it using <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/arrayfun.html\"><tt>arrayfun<\/tt><\/a>.  For GPU array inputs, the function used with arrayfun gets compiled into native GPU code.  In this case we placed the loop\r\n      in <tt>processMandelbrotElement.m<\/tt> which looks as follows:\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">function count = processMandelbrotElement(x0,y0,maxIterations)\r\nz0 = complex(x0,y0);\r\nz = z0;\r\ncount = 1;\r\nwhile (count &lt;= maxIterations) ...\r\n      &amp;&amp; ((real(z)*real(z) + imag(z)*imag(z)) &lt;= 4)\r\n    count = count + 1;\r\n    z = z*z + z0;\r\nend\r\ncount = log(count);<\/pre><p>Note that an early abort has been introduced since this function only processes a single element. For most views of the Mandelbrot\r\n      Set a significant number of elements stop very early and this can save a lot of processing. The <tt>for<\/tt> loop has also been replaced by a <tt>while<\/tt> loop because they are usually more efficient. This function makes no mention of the GPU and uses no GPU-specific features\r\n      - it is standard MATLAB code.\r\n   <\/p>\r\n   <p>Using <tt>arrayfun<\/tt> means that instead of many thousands of calls to separate GPU-optimized operations (at least 6 per iteration), we make one\r\n      call to a parallelized GPU operation that performs the whole calculation. This significantly reduces overhead - 164 times\r\n      faster.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">% Setup<\/span>\r\nt = tic();\r\nx = parallel.gpu.GPUArray.linspace( xlim(1), xlim(2), gridSize );\r\ny = parallel.gpu.GPUArray.linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\n\r\n<span style=\"color: #228B22\">% Calculate<\/span>\r\ncount = arrayfun( @processMandelbrotElement, xGrid, yGrid, maxIterations );\r\n\r\n<span style=\"color: #228B22\">% Show<\/span>\r\ngpuArrayfunTime = toc( t );\r\nimagesc( x, y, count )\r\naxis <span style=\"color: #A020F0\">image<\/span>\r\ntitle( sprintf( <span style=\"color: #A020F0\">'%1.2fsecs (GPU arrayfun) = %1.1fx faster'<\/span>, <span style=\"color: #0000FF\">...<\/span>\r\n    gpuArrayfunTime, cpuTime\/gpuArrayfunTime ) );<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/282\/gpu_mandelbrot_blog_03.png\"> <h3>Working in CUDA - 340 Times Faster<a name=\"6\"><\/a><\/h3>\r\n   <p>In <a href=\"https:\/\/www.mathworks.com\/moler\/exm\/index.html\">Experiments in MATLAB<\/a> improved performance is achieved by converting the basic algorithm to a C-Mex function. If you're willing to do some work\r\n      in C\/C++, then you can use the Parallel Computing Toolbox to call pre-written CUDA kernels using MATLAB data. This is done\r\n      using the toolbox's <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/parallel.gpu.cudakernel.html\"><tt>parallel.gpu.CUDAKernel<\/tt><\/a> feature.\r\n   <\/p>\r\n   <p>A CUDA\/C++ implementation of the element processing algorithm has been hand-written in <tt>processMandelbrotElement.cu<\/tt>. This must then be manually compiled using nVidia's NVCC compiler to produce the assembly-level <tt>processMandelbrotElement.ptx<\/tt> (<tt>.ptx<\/tt> stands for \"Parallel Thread eXecution language\").\r\n   <\/p>\r\n   <p>The CUDA\/C++ code is a little more involved than the MATLAB versions we've seen so far due to the lack of complex numbers\r\n      in C++. However, the essence of the algorithm is unchanged:\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\">__device__\r\nunsigned int doIterations( double const realPart0,\r\n                           double const imagPart0,\r\n                           unsigned int const maxIters ) {\r\n   \/\/ Initialize: z = z0\r\n   double realPart = realPart0;\r\n   double imagPart = imagPart0;\r\n   unsigned int count = 0;\r\n   \/\/ Loop until escape\r\n   while ( ( count &lt;= maxIters )\r\n          &amp;&amp; ((realPart*realPart + imagPart*imagPart) &lt;= 4.0) ) {\r\n      ++count;\r\n      \/\/ Update: z = z*z + z0;\r\n      double const oldRealPart = realPart;\r\n      realPart = realPart*realPart - imagPart*imagPart + realPart0;\r\n      imagPart = 2.0*oldRealPart*imagPart + imagPart0;\r\n   }\r\n   return count;\r\n}<\/pre><p>(the full source-code is available in the file-exchange submission linked at the end.)<\/p>\r\n   <p>One GPU thread is required for each element in the array, divided into blocks. The kernel indicates how big a thread-block\r\n      is, and this is used to calculate the number of blocks required.\r\n   <\/p><pre style=\"background: #F9F7F3; padding: 10px; border: 1px solid rgb(200,200,200)\"><span style=\"color: #228B22\">% Setup<\/span>\r\nt = tic();\r\nx = parallel.gpu.GPUArray.linspace( xlim(1), xlim(2), gridSize );\r\ny = parallel.gpu.GPUArray.linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\n\r\n<span style=\"color: #228B22\">% Load the kernel<\/span>\r\nkernel = parallel.gpu.CUDAKernel( <span style=\"color: #A020F0\">'processMandelbrotElement.ptx'<\/span>, <span style=\"color: #0000FF\">...<\/span>\r\n                                  <span style=\"color: #A020F0\">'processMandelbrotElement.cu'<\/span> );\r\n\r\n<span style=\"color: #228B22\">% Make sure we have sufficient blocks to cover the whole array<\/span>\r\nnumElements = numel( xGrid );\r\nkernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1];\r\nkernel.GridSize = [ceil(numElements\/kernel.MaxThreadsPerBlock),1];\r\n\r\n<span style=\"color: #228B22\">% Call the kernel<\/span>\r\ncount = parallel.gpu.GPUArray.zeros( size(xGrid) );\r\ncount = feval( kernel, count, xGrid, yGrid, maxIterations, numElements );\r\n\r\n<span style=\"color: #228B22\">% Show<\/span>\r\ngpuCUDAKernelTime = toc( t );\r\nimagesc( x, y, count )\r\naxis <span style=\"color: #A020F0\">image<\/span>\r\ntitle( sprintf( <span style=\"color: #A020F0\">'%1.2fsecs (GPU CUDAKernel) = %1.1fx faster'<\/span>, <span style=\"color: #0000FF\">...<\/span>\r\n    gpuCUDAKernelTime, cpuTime\/gpuCUDAKernelTime ) );<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/282\/gpu_mandelbrot_blog_04.png\"> <h3>Summary<a name=\"7\"><\/a><\/h3>\r\n   <p>When my brothers and I sat down and coded up our first Mandelbrot set it took over a minute to render a 32x24 image.  Here\r\n      we have seen how some simple steps and some nice hardware can now render 1000x1000 images at several frames per second. How\r\n      times change!\r\n   <\/p>\r\n   <p>We have looked at three ways of speeding up the basic MATLAB implementation:<\/p>\r\n   <ol>\r\n   <li>Convert the input data to be on the GPU using   <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/gpuarray.html\"><tt>gpuArray<\/tt><\/a>,   leaving the algorithm unchanged\r\n   <\/li>\r\n   <li>Use <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/arrayfun.html\"><tt>arrayfun<\/tt><\/a> on a <tt>GPUArray<\/tt> input to perform the algorithm on each element of the input independently\r\n   <\/li>\r\n   <li>Use <a href=\"https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/parallel.gpu.cudakernel.html\"><tt>parallel.gpu.CUDAKernel<\/tt><\/a> to run some existing CUDA\/C++ code using MATLAB data\r\n   <\/li>\r\n   <\/ol>\r\n   <p>I've also created a graphical application that lets you explore the Mandelbrot Set using each of these approaches. You can\r\n      download this from the File Exchange:\r\n   <\/p>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/30988-a-gpu-mandelbrot-set\">https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/30988-a-gpu-mandelbrot-set<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>If you have any ideas for creating Mandelbrot Sets at even greater speed or can think of other algorithms that might make\r\n      good use of the GPU, please leave your comments and questions <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=282#respond\">below<\/a>.\r\n   <\/p>\r\n   <h3>References<a name=\"8\"><\/a><\/h3>\r\n   <p>Read a bit more about the life and times of Benoit Mandelbrot:<\/p>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"http:\/\/www.guardian.co.uk\/science\/2010\/oct\/17\/benoit-mandelbrot-obituary\">http:\/\/www.guardian.co.uk\/science\/2010\/oct\/17\/benoit-mandelbrot-obituary<\/a><\/li>\r\n         <li><a href=\"http:\/\/en.wikipedia.org\/wiki\/Benoit_Mandelbrot\">http:\/\/en.wikipedia.org\/wiki\/Benoit_Mandelbrot<\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>See his recent talk at TED:<\/p>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"https:\/\/www.ted.com\/talks\/benoit_mandelbrot_fractals_and_the_art_of_roughness \">https:\/\/www.ted.com\/talks\/benoit_mandelbrot_fractals_and_the_art_of_roughness <\/a><\/li>\r\n      <\/ul>\r\n   <\/div>\r\n   <p>Have a look at Cleve Moler's chapter on the Mandelbrot Set:<\/p>\r\n   <div>\r\n      <ul>\r\n         <li><a href=\"https:\/\/www.mathworks.com\/moler\/exm\/chapters.html\">https:\/\/www.mathworks.com\/moler\/exm\/chapters.html<\/a><\/li>\r\n      <\/ul>\r\n   <\/div><script language=\"JavaScript\">\r\n<!--\r\n\r\n    function grabCode_923c47972e2b4a42b1adb787cf3277a5() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='923c47972e2b4a42b1adb787cf3277a5 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 923c47972e2b4a42b1adb787cf3277a5';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        author = 'Ben Tordoff';\r\n        copyright = 'Copyright 2011 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add author and copyright lines at the bottom if specified.\r\n        if ((author.length > 0) || (copyright.length > 0)) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (author.length > 0) {\r\n                d.writeln('% _' + author + '_');\r\n            }\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n      \r\n      d.title = title + ' (MATLAB code)';\r\n      d.close();\r\n      }   \r\n      \r\n-->\r\n<\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\">Copyright 2011 The MathWorks, Inc.<br><a href=\"javascript:grabCode_923c47972e2b4a42b1adb787cf3277a5()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n            the MATLAB code \r\n            <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; 7.12<br><\/p>\r\n<\/div>\r\n<!--\r\n923c47972e2b4a42b1adb787cf3277a5 ##### SOURCE BEGIN #####\r\n%% A Mandelbrot Set on the GPU\r\n%\r\n% Today I welcome back guest blogger\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/80363 Ben Tordoff>\r\n% who last blogged here on how to \r\n% <https:\/\/blogs.mathworks.com\/loren\/2009\/12\/16\/carving-a-dinosaur Carve a Dinosaur>.\r\n% These days Ben works on the \r\n% <https:\/\/www.mathworks.com\/products\/parallel-computing Parallel Computing Toolbox(TM)>\r\n% with particular focus on GPUs.\r\n\r\n%% About this demo\r\n% Benoit Mandelbrot died last Autumn, and despite inventing the term \r\n% \"fractal\" and being one of the pioneers of fractal geometry, his lasting\r\n% legacy for most people will undoubtedly be the fractal that bears his \r\n% name.  There can be few mathematicians or mathematical algorithms that \r\n% have adorned as many teenage bedroom walls as Mandelbrot's set did in the\r\n% late '80s and early '90s.\r\n%\r\n% I also suspect there is no other fractal that has been\r\n% implemented in quite as many different languages on different platforms.\r\n% My first version was a blocky 32x24 version on an aging ZX Spectrum with\r\n% seven glorious colours.  These days we have a bit more graphical power\r\n% available, and a few more colours.  In fact, as we shall see, an\r\n% algorithm like the Mandelbrot Set is ideally suited to running on that\r\n% GPU (Graphics Processing Unit) which mostly sits idle in your PC.\r\n%\r\n% As a starting point I will use a version of the Mandelbrot set taken\r\n% loosely from Cleve Moler's \r\n% <https:\/\/www.mathworks.com\/moler\/exm\/chapters.html Experiments with MATLAB>\r\n% e-book.  We will\r\n% then look at the three different ways that the Parallel Computing\r\n% Toolbox(TM) provides for speeding this up using the GPU:\r\n%\r\n% # Using the existing algorithm but with GPU data as input\r\n% # Using |arrayfun| to perform the algorithm on each element independently\r\n% # Using the MATLAB\/CUDA interface to run some existing CUDA\/C++ code\r\n%\r\n% As you will see, these three methods are of increasing difficulty but\r\n% repay the effort with hugely increased speed. Choosing a method to trade\r\n% off difficulty and speed is typical of parallel problems.\r\n\r\n% Copyright 2011 The MathWorks, Inc.\r\n\r\n\r\n%% Setup\r\n% The values below specify a highly zoomed part of the Mandelbrot Set in\r\n% the valley between the main cardioid and the p\/q bulb to its left.  \r\n%\r\n% <<paralleldemo_gpu_mandelbrot_location.png>>\r\n%\r\n%\r\n% The following code forms the set of starting points for each of the\r\n% calculations by creating a 1000x1000 grid of real parts (X) and imaginary\r\n% parts (Y) between these limits. For this particular location I happen to\r\n% know that 500 iterations will be enough to draw a nice picture.\r\n\r\nmaxIterations = 500;\r\ngridSize = 1000;\r\nxlim = [-0.748766713922161, -0.748766707771757];\r\nylim = [ 0.123640844894862,  0.123640851045266];\r\n\r\n\r\n%% The Mandelbrot Set in MATLAB\r\n% Below is an implementation of the Mandelbrot Set using\r\n% standard MATLAB commands running on the CPU. This calculation is\r\n% vectorized such that every location is updated at once. \r\n\r\n% Setup\r\nt = tic();\r\nx = linspace( xlim(1), xlim(2), gridSize );\r\ny = linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\nz0 = xGrid + 1i*yGrid;\r\ncount = zeros( size(z0) );\r\n\r\n% Calculate\r\nz = z0;\r\nfor n = 0:maxIterations\r\n    z = z.*z + z0;\r\n    inside = abs( z )<=2;\r\n    count = count + inside;\r\nend\r\ncount = log( count+1 );\r\n\r\n% Show\r\ncpuTime = toc( t );\r\nset( gcf, 'Position', [200 200 600 600] );\r\nimagesc( x, y, count );\r\naxis image\r\ncolormap( [jet();flipud( jet() );0 0 0] );\r\ntitle( sprintf( '%1.2fsecs (without GPU)', cpuTime ) );\r\n\r\n\r\n%% Using parallel.gpu.GPUArray - 16 Times Faster\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/gpuarray.html |parallel.gpu.GPUArray|>\r\n% provides GPU versions of many functions that can\r\n% be used to create data arrays, including the \r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/linspace.html |linspace|>, \r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/logspace.html |logspace|>, and\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/meshgrid.html |meshgrid|> functions\r\n% needed here.  Similarly, the |count| array is initialized directly on the\r\n% GPU using the function \r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/techdoc\/\/ref\/zeros.html |parallel.gpu.GPUArray.zeros|>.\r\n%\r\n% Other than these simple changes to the data initialization, the algorithm\r\n% is unchanged (and >16 times faster):\r\n\r\n% Setup\r\nt = tic();\r\nx = parallel.gpu.GPUArray.linspace( xlim(1), xlim(2), gridSize );\r\ny = parallel.gpu.GPUArray.linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\nz0 = complex( xGrid, yGrid );\r\ncount = parallel.gpu.GPUArray.zeros( size(z0) );\r\n\r\n% Calculate\r\nz = z0;\r\nfor n = 0:maxIterations\r\n    z = z.*z + z0;\r\n    inside = abs( z )<=2;\r\n    count = count + inside;\r\nend\r\ncount = log( count+1 );\r\n\r\n% Show\r\nnaiveGPUTime = toc( t );\r\nimagesc( x, y, count )\r\naxis image\r\ntitle( sprintf( '%1.2fsecs (naive GPU) = %1.1fx faster', ...\r\n    naiveGPUTime, cpuTime\/naiveGPUTime ) )\r\n\r\n\r\n%% Using Element-wise Operation - 164 Times Faster\r\n% Noticing that the algorithm is operating equally on every element of the\r\n% input, we can place the code in a helper function and call it using\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/arrayfun.html |arrayfun|>.  For GPU array\r\n% inputs, the function used with arrayfun gets compiled into native GPU\r\n% code.  In this case we placed the loop in\r\n% |processMandelbrotElement.m| which looks as follows:\r\n%\r\n%  function count = processMandelbrotElement(x0,y0,maxIterations)\r\n%  z0 = complex(x0,y0);\r\n%  z = z0;\r\n%  count = 1;\r\n%  while (count <= maxIterations) ...\r\n%        && ((real(z)*real(z) + imag(z)*imag(z)) <= 4)\r\n%      count = count + 1;\r\n%      z = z*z + z0;\r\n%  end\r\n%  count = log(count);\r\n%\r\n% Note that an early abort has been introduced since this function only\r\n% processes a single element. For most views of the Mandelbrot Set a\r\n% significant number of elements stop very early and this can save a lot of\r\n% processing. The |for| loop has also been replaced by a |while| loop\r\n% because they are usually more efficient. This function makes no mention\r\n% of the GPU and uses no GPU-specific features - it is standard MATLAB\r\n% code.\r\n%\r\n% Using |arrayfun| means that instead of many thousands of calls to\r\n% separate GPU-optimized operations (at least 6 per iteration), we make one\r\n% call to a parallelized GPU operation that performs the whole calculation.\r\n% This significantly reduces overhead - 164 times faster.\r\n\r\n% Setup\r\nt = tic();\r\nx = parallel.gpu.GPUArray.linspace( xlim(1), xlim(2), gridSize );\r\ny = parallel.gpu.GPUArray.linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\n\r\n% Calculate\r\ncount = arrayfun( @processMandelbrotElement, xGrid, yGrid, maxIterations );\r\n\r\n% Show\r\ngpuArrayfunTime = toc( t );\r\nimagesc( x, y, count )\r\naxis image\r\ntitle( sprintf( '%1.2fsecs (GPU arrayfun) = %1.1fx faster', ...\r\n    gpuArrayfunTime, cpuTime\/gpuArrayfunTime ) );\r\n\r\n\r\n%% Working in CUDA - 340 Times Faster\r\n% In <https:\/\/www.mathworks.com\/moler\/exm\/index.html Experiments in MATLAB> \r\n% improved performance is achieved by\r\n% converting the basic algorithm to a C-Mex function. If you're willing to\r\n% do some work in C\/C++, then you can use the Parallel Computing Toolbox \r\n% to call pre-written CUDA kernels using MATLAB data. This is done\r\n% using the toolbox's\r\n% <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/parallel.gpu.cudakernel.html |parallel.gpu.CUDAKernel|>\r\n% feature.\r\n%\r\n% A CUDA\/C++\r\n% implementation of the element processing algorithm has been hand-written\r\n% in |processMandelbrotElement.cu|. This must then be manually compiled\r\n% using nVidia's NVCC compiler to produce the assembly-level\r\n% |processMandelbrotElement.ptx| (|.ptx| stands for \"Parallel Thread\r\n% eXecution language\").\r\n%\r\n% The CUDA\/C++ code is a little more involved than the MATLAB versions \r\n% we've seen so far due to the lack of complex numbers in C++. However, the\r\n% essence of the algorithm is unchanged:\r\n%\r\n%  __device__\r\n%  unsigned int doIterations( double const realPart0, \r\n%                             double const imagPart0, \r\n%                             unsigned int const maxIters ) {\r\n%     \/\/ Initialize: z = z0\r\n%     double realPart = realPart0;\r\n%     double imagPart = imagPart0;\r\n%     unsigned int count = 0;\r\n%     \/\/ Loop until escape\r\n%     while ( ( count <= maxIters )\r\n%            && ((realPart*realPart + imagPart*imagPart) <= 4.0) ) {\r\n%        ++count;\r\n%        \/\/ Update: z = z*z + z0;\r\n%        double const oldRealPart = realPart;\r\n%        realPart = realPart*realPart - imagPart*imagPart + realPart0;\r\n%        imagPart = 2.0*oldRealPart*imagPart + imagPart0;\r\n%     }\r\n%     return count;\r\n%  }\r\n%\r\n% (the full source-code is available in the file-exchange submission linked\r\n% at the end.)\r\n%\r\n% One GPU thread is required for each element in the array, divided\r\n% into blocks. The kernel indicates how big a thread-block is, and this is\r\n% used to calculate the number of blocks required.\r\n\r\n% Setup\r\nt = tic();\r\nx = parallel.gpu.GPUArray.linspace( xlim(1), xlim(2), gridSize );\r\ny = parallel.gpu.GPUArray.linspace( ylim(1), ylim(2), gridSize );\r\n[xGrid,yGrid] = meshgrid( x, y );\r\n\r\n% Load the kernel\r\nkernel = parallel.gpu.CUDAKernel( 'processMandelbrotElement.ptx', ...\r\n                                  'processMandelbrotElement.cu' );\r\n\r\n% Make sure we have sufficient blocks to cover the whole array\r\nnumElements = numel( xGrid );\r\nkernel.ThreadBlockSize = [kernel.MaxThreadsPerBlock,1,1];\r\nkernel.GridSize = [ceil(numElements\/kernel.MaxThreadsPerBlock),1];\r\n                              \r\n% Call the kernel\r\ncount = parallel.gpu.GPUArray.zeros( size(xGrid) );\r\ncount = feval( kernel, count, xGrid, yGrid, maxIterations, numElements );\r\n  \r\n% Show\r\ngpuCUDAKernelTime = toc( t );\r\nimagesc( x, y, count )\r\naxis image\r\ntitle( sprintf( '%1.2fsecs (GPU CUDAKernel) = %1.1fx faster', ...\r\n    gpuCUDAKernelTime, cpuTime\/gpuCUDAKernelTime ) );\r\n                      \r\n\r\n%% Summary\r\n% When my brothers and I sat down and coded up our first Mandelbrot set it\r\n% took over a minute to render a 32x24 image.  Here we have seen how some\r\n% simple steps and some nice hardware can now render 1000x1000 images\r\n% at several frames per second. How times change!\r\n%\r\n% We have looked at three ways of speeding up the basic MATLAB implementation:\r\n%\r\n% # Convert the input data to be on the GPU using \r\n%   <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/gpuarray.html |gpuArray|>,\r\n%   leaving the algorithm unchanged\r\n% # Use <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/arrayfun.html |arrayfun|> on a |GPUArray| input to perform the algorithm on each\r\n% element of the input independently\r\n% # Use <https:\/\/www.mathworks.com\/help\/releases\/R2011a\/toolbox\/distcomp\/parallel.gpu.cudakernel.html |parallel.gpu.CUDAKernel|> to run some existing CUDA\/C++ code using\r\n% MATLAB data\r\n%\r\n% I've also created a graphical application that lets you explore the \r\n% Mandelbrot Set using each of these approaches. You can download this from\r\n% the File Exchange:\r\n%\r\n% * https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/30988-a-gpu-mandelbrot-set\r\n%\r\n% If you have any ideas for creating Mandelbrot Sets at even greater speed\r\n% or can think of other algorithms that might make good use of the GPU,\r\n% please leave your comments and questions\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=282#respond below>.\r\n\r\n%% References\r\n% Read a bit more about the life and times of Benoit Mandelbrot:\r\n%\r\n% * http:\/\/www.guardian.co.uk\/science\/2010\/oct\/17\/benoit-mandelbrot-obituary\r\n% * http:\/\/en.wikipedia.org\/wiki\/Benoit_Mandelbrot\r\n%\r\n% See his recent talk at TED:\r\n%\r\n% * https:\/\/www.ted.com\/talks\/benoit_mandelbrot_fractals_and_the_art_of_roughness\r\n%\r\n% Have a look at Cleve Moler's chapter on the Mandelbrot Set:\r\n%\r\n% * https:\/\/www.mathworks.com\/moler\/exm\/chapters.html\r\n##### SOURCE END ##### 923c47972e2b4a42b1adb787cf3277a5\r\n-->\r\n","protected":false},"excerpt":{"rendered":"<p>\r\n   \r\n      Today I welcome back guest blogger Ben Tordoff who last blogged here on how to Carve a Dinosaur. These days Ben works on the Parallel Computing Toolbox&#8482; with particular focus on... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[55,6,34],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/282"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=282"}],"version-history":[{"count":1,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/282\/revisions"}],"predecessor-version":[{"id":717,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/282\/revisions\/717"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=282"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=282"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=282"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}