{"id":588,"date":"2012-12-14T12:48:03","date_gmt":"2012-12-14T17:48:03","guid":{"rendered":"https:\/\/blogs.mathworks.com\/loren\/?p=588"},"modified":"2013-06-17T13:49:10","modified_gmt":"2013-06-17T18:49:10","slug":"measuring-gpu-performance","status":"publish","type":"post","link":"https:\/\/blogs.mathworks.com\/loren\/2012\/12\/14\/measuring-gpu-performance\/","title":{"rendered":"Measuring GPU Performance"},"content":{"rendered":"<!DOCTYPE html\r\n  PUBLIC \"-\/\/W3C\/\/DTD HTML 4.01 Transitional\/\/EN\">\r\n<style type=\"text\/css\">\r\n\r\nh1 { font-size:18pt; }\r\nh2.titlebg { font-size:13pt; }\r\nh3 { color:#4A4F55; padding:0px; margin:5px 0px 5px; font-family:Arial, Helvetica, sans-serif; font-size:11pt; font-weight:bold; line-height:140%; border-bottom:1px solid #d6d4d4; display:block; }\r\nh4 { color:#4A4F55; padding:0px; margin:0px 0px 5px; font-family:Arial, Helvetica, sans-serif; font-size:10pt; font-weight:bold; line-height:140%; border-bottom:1px solid #d6d4d4; display:block; }\r\n   \r\np { padding:0px; margin:0px 0px 20px; }\r\nimg { padding:0px; margin:0px 0px 20px; border:none; }\r\np img, pre img, tt img, li img { margin-bottom:0px; } \r\n\r\nul { padding:0px; margin:0px 0px 20px 23px; list-style:square; }\r\nul li { padding:0px; margin:0px 0px 7px 0px; background:none; }\r\nul li ul { padding:5px 0px 0px; margin:0px 0px 7px 23px; }\r\nul li ol li { list-style:decimal; }\r\nol { padding:0px; margin:0px 0px 20px 0px; list-style:decimal; }\r\nol li { padding:0px; margin:0px 0px 7px 23px; list-style-type:decimal; }\r\nol li ol { padding:5px 0px 0px; margin:0px 0px 7px 0px; }\r\nol li ol li { list-style-type:lower-alpha; }\r\nol li ul { padding-top:7px; }\r\nol li ul li { list-style:square; }\r\n\r\npre, tt, code { font-size:12px; }\r\npre { margin:0px 0px 20px; }\r\npre.error { color:red; }\r\npre.codeinput { padding:10px; border:1px solid #d3d3d3; background:#f7f7f7; }\r\npre.codeoutput { padding:10px 11px; margin:0px 0px 20px; color:#4c4c4c; }\r\n\r\n@media print { pre.codeinput, pre.codeoutput { word-wrap:break-word; width:100%; } }\r\n\r\nspan.keyword { color:#0000FF }\r\nspan.comment { color:#228B22 }\r\nspan.string { color:#A020F0 }\r\nspan.untermstring { color:#B20000 }\r\nspan.syscmd { color:#B28C00 }\r\n\r\n.footer { width:auto; padding:10px 0px; margin:25px 0px 0px; border-top:1px dotted #878787; font-size:0.8em; line-height:140%; font-style:italic; color:#878787; text-align:left; float:none; }\r\n.footer p { margin:0px; }\r\n\r\n  <\/style><div class=\"content\"><!--introduction--><p>Today I welcome back guest blogger <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/80363\">Ben Tordoff<\/a> who previously wrote here on how to <a href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\">generate a fractal on a GPU<\/a>. He is going to continue this GPU theme below, looking at how to measure the performance of a GPU.<\/p><!--\/introduction--><h3>Contents<\/h3><div><ul><li><a href=\"#e4af56c5-a95b-4cf0-bb6e-25549223f062\">Measuring GPU Performance<\/a><\/li><li><a href=\"#af2f0a60-8f03-4279-9ac2-ce6b9466aba4\">How Timing Is Measured<\/a><\/li><li><a href=\"#1fa09fa2-c99c-4bb0-8b11-eb805fdd7040\">Test Host\/GPU Bandwidth<\/a><\/li><li><a href=\"#2d66586d-e246-40f9-ac8b-821eaa83055b\">Test Memory-Intensive Operations<\/a><\/li><li><a href=\"#b99b5df4-cdeb-456b-bef9-5236a9082ec7\">Test Computation-Intensive Calculations<\/a><\/li><li><a href=\"#3bb2db18-e8e7-43c6-8e60-66d0633602cc\">Comparing GPUs<\/a><\/li><li><a href=\"#ce5dce69-129b-44a7-87ea-8b7c8a9ac158\">Conclusions<\/a><\/li><\/ul><\/div><h4>Measuring GPU Performance<a name=\"e4af56c5-a95b-4cf0-bb6e-25549223f062\"><\/a><\/h4><p>Whether you are thinking about buying yourself a new beefy GPU or have just splashed-out on one, you may well be asking yourself how fast it is. In this article I will describe and attempt to measure some of the key performance characteristics of a GPU. This should give you some insight into the relative merits of using the GPU over the CPU and also some idea of how different GPUs compare to each other.<\/p><p>There are a vast array of benchmarks to choose from, so I have narrowed this down to three tests.:<\/p><div><ul><li>How quickly can we send data to the GPU or read it back again?<\/li><li>How fast can the GPU kernel read and write data?<\/li><li>How fast can the GPU do computations?<\/li><\/ul><\/div><p>After measuring each of these, I can compare my GPU with other GPUs.<\/p><h4>How Timing Is Measured<a name=\"af2f0a60-8f03-4279-9ac2-ce6b9466aba4\"><\/a><\/h4><p>In the following sections, each test is repeated many times to allow for other activities going on on my PC and the first-call overheads. I keep the minimum of the results, because external factors can only ever slow down execution.<\/p><p>To get accurate timing figures I use <tt>wait(gpu)<\/tt> to ensure the GPU has finished working before stopping the timer. You should not do this in normal code. For best performance you want to let the GPU carry on working whilst the CPU gets on with other things. MATLAB automatically takes care of any synchronisation that is required.<\/p><p>I have put the code into a function so that variables are scoped. This can make a big difference in terms of memory performance since MATLAB is better able to re-use arrays.<\/p><pre class=\"codeinput\"><span class=\"keyword\">function<\/span> gpu_benchmarking\r\n<\/pre><pre class=\"codeinput\">gpu = gpuDevice();\r\nfprintf(<span class=\"string\">'I have a %s GPU.\\n'<\/span>, gpu.Name)\r\n<\/pre><pre class=\"codeoutput\">I have a Tesla C2075 GPU.\r\n<\/pre><h4>Test Host\/GPU Bandwidth<a name=\"1fa09fa2-c99c-4bb0-8b11-eb805fdd7040\"><\/a><\/h4><p>The first test tries to measure how quickly data can be sent-to and read-from the GPU. Since the GPU is plugged into the PCI bus, this largely depends on how good your PCI bus is and how many other things are using it. However, there are also some overheads that are included in the measurements, particularly the function call overhead and the array allocation time. Since these are present in any \"real world\" use of the GPU it is reasonable to include these.<\/p><p>In the following tests, data is allocated\/sent to the GPU using the <tt>gpuArray<\/tt> function and allocated\/returned to host memory using <tt>gather<\/tt>. The arrays are created using <tt>uint8<\/tt> so that each element is a single byte.<\/p><p>Note that PCI express v2, as used in this test, has a theoretical bandwidth of 0.5GB\/s per lane. For the 16-lane slots (PCIe2 x16) used by NVIDIA's Tesla cards this gives a theoretical 8GB\/s.<\/p><pre class=\"codeinput\">sizes = power(2, 12:26);\r\nrepeats = 10;\r\n\r\nsendTimes = inf(size(sizes));\r\ngatherTimes = inf(size(sizes));\r\n<span class=\"keyword\">for<\/span> ii=1:numel(sizes)\r\n    data = randi([0 255], sizes(ii), 1, <span class=\"string\">'uint8'<\/span>);\r\n    <span class=\"keyword\">for<\/span> rr=1:repeats\r\n        timer = tic();\r\n        gdata = gpuArray(data);\r\n        wait(gpu);\r\n        sendTimes(ii) = min(sendTimes(ii), toc(timer));\r\n\r\n        timer = tic();\r\n        data2 = gather(gdata); <span class=\"comment\">%#ok&lt;NASGU&gt;<\/span>\r\n        gatherTimes(ii) = min(gatherTimes(ii), toc(timer));\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nsendBandwidth = (sizes.\/sendTimes)\/1e9;\r\n[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);\r\nfprintf(<span class=\"string\">'Peak send speed is %g GB\/s\\n'<\/span>,maxSendBandwidth)\r\ngatherBandwidth = (sizes.\/gatherTimes)\/1e9;\r\n[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);\r\nfprintf(<span class=\"string\">'Peak gather speed is %g GB\/s\\n'<\/span>,max(gatherBandwidth))\r\n<\/pre><pre class=\"codeoutput\">Peak send speed is 5.70217 GB\/s\r\nPeak gather speed is 3.99077 GB\/s\r\n<\/pre><p>On the plot, you can see where the peak was achieved in each case (circled). At small sizes, the bandwidth of the PCI bus is irrelevant since the overheads dominate. At larger sizes the PCI bus is the limiting factor and the curves flatten out. Since the PC and all of the GPUs I have use the same PCI v2, there is little merit in comparing different GPUs. PCI v3 hardware is starting to appear though, so maybe this will become more interesting in future.<\/p><pre class=\"codeinput\">hold <span class=\"string\">off<\/span>\r\nsemilogx(sizes, sendBandwidth, <span class=\"string\">'b.-'<\/span>, sizes, gatherBandwidth, <span class=\"string\">'r.-'<\/span>)\r\nhold <span class=\"string\">on<\/span>\r\nsemilogx(sizes(maxSendIdx), maxSendBandwidth, <span class=\"string\">'bo-'<\/span>, <span class=\"string\">'MarkerSize'<\/span>, 10);\r\nsemilogx(sizes(maxGatherIdx), maxGatherBandwidth, <span class=\"string\">'ro-'<\/span>, <span class=\"string\">'MarkerSize'<\/span>, 10);\r\ngrid <span class=\"string\">on<\/span>\r\ntitle(<span class=\"string\">'Data Transfer Bandwidth'<\/span>)\r\nxlabel(<span class=\"string\">'Array size (bytes)'<\/span>)\r\nylabel(<span class=\"string\">'Transfer speed (GB\/s)'<\/span>)\r\nlegend(<span class=\"string\">'Send'<\/span>,<span class=\"string\">'Gather'<\/span>,<span class=\"string\">'Location'<\/span>,<span class=\"string\">'NorthWest'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2012\/gpu_benchmarking_01.png\" alt=\"\"> <h4>Test Memory-Intensive Operations<a name=\"2d66586d-e246-40f9-ac8b-821eaa83055b\"><\/a><\/h4><p>Many operations you might want to perform do very little computation with each element of an array and are therefore dominated by the time taken to fetch the data from memory or write it back. Functions such as ONES, ZEROS, NAN, TRUE only write their output, whereas functions like TRANSPOSE, TRIL\/TRIU both read and write but do no computation. Even simple operators like PLUS, MINUS, MTIMES do so little computation per element that they are bound only by the memory access speed.<\/p><p>I can use a simple PLUS operation to measure how fast my machine can read and write memory. This involves reading each double precision number (i.e., 8 bytes per element of the input), adding one and then writing it out again (i.e., another 8 bytes per element).<\/p><pre class=\"codeinput\">sizeOfDouble = 8;\r\nreadWritesPerElement = 2;\r\nmemoryTimesGPU = inf(size(sizes));\r\n<span class=\"keyword\">for<\/span> ii=1:numel(sizes)\r\n    numElements = sizes(ii)\/sizeOfDouble;\r\n    data = gpuArray.zeros(numElements, 1, <span class=\"string\">'double'<\/span>);\r\n    <span class=\"keyword\">for<\/span> rr=1:repeats\r\n        timer = tic();\r\n        <span class=\"keyword\">for<\/span> jj=1:100\r\n            data = data + 1;\r\n        <span class=\"keyword\">end<\/span>\r\n        wait(gpu);\r\n        memoryTimesGPU(ii) = min(memoryTimesGPU(ii), toc(timer)\/100);\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nmemoryBandwidth = readWritesPerElement*(sizes.\/memoryTimesGPU)\/1e9;\r\n[maxBWGPU, maxBWIdxGPU] = max(memoryBandwidth);\r\nfprintf(<span class=\"string\">'Peak read\/write speed on the GPU is %g GB\/s\\n'<\/span>,maxBWGPU)\r\n<\/pre><pre class=\"codeoutput\">Peak read\/write speed on the GPU is 110.993 GB\/s\r\n<\/pre><p>To know whether this is fast or not, I compare it with the same code running on the CPU. Note, however, that the CPU has several levels of caching and some oddities like \"read before write\" that can make the results look a little odd. For my PC the theoretical bandwidth of main memory is 32GB\/s, so anything above this is likely to be due to efficient caching.<\/p><pre class=\"codeinput\">memoryTimesHost = inf(size(sizes));\r\n<span class=\"keyword\">for<\/span> ii=1:numel(sizes)\r\n    numElements = sizes(ii)\/sizeOfDouble;\r\n    <span class=\"keyword\">for<\/span> rr=1:repeats\r\n        hostData = zeros(numElements,1);\r\n        timer = tic();\r\n        <span class=\"keyword\">for<\/span> jj=1:100\r\n            hostData = hostData + 1;\r\n        <span class=\"keyword\">end<\/span>\r\n        memoryTimesHost(ii) = min(memoryTimesHost(ii), toc(timer)\/100);\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nmemoryBandwidthHost = 2*(sizes.\/memoryTimesHost)\/1e9;\r\n[maxBWHost, maxBWHostIdx] = max(memoryBandwidthHost);\r\nfprintf(<span class=\"string\">'Peak write speed on the host is %g GB\/s\\n'<\/span>,maxBWHost)\r\n\r\n<span class=\"comment\">% Plot CPU and GPU results.<\/span>\r\nhold <span class=\"string\">off<\/span>\r\nsemilogx(sizes, memoryBandwidth, <span class=\"string\">'b.-'<\/span>, <span class=\"keyword\">...<\/span>\r\n    sizes, memoryBandwidthHost, <span class=\"string\">'k.-'<\/span>)\r\nhold <span class=\"string\">on<\/span>\r\nsemilogx(sizes(maxBWIdxGPU), maxBWGPU, <span class=\"string\">'bo-'<\/span>, <span class=\"string\">'MarkerSize'<\/span>, 10);\r\nsemilogx(sizes(maxBWHostIdx), maxBWHost, <span class=\"string\">'ko-'<\/span>, <span class=\"string\">'MarkerSize'<\/span>, 10);\r\ngrid <span class=\"string\">on<\/span>\r\ntitle(<span class=\"string\">'Read\/Write Bandwidth'<\/span>)\r\nxlabel(<span class=\"string\">'Array size (bytes)'<\/span>)\r\nylabel(<span class=\"string\">'Speed (GB\/s)'<\/span>)\r\nlegend(<span class=\"string\">'Read+Write (GPU)'<\/span>,<span class=\"string\">'Read+Write (host)'<\/span>,<span class=\"string\">'Location'<\/span>,<span class=\"string\">'NorthWest'<\/span>)\r\n<\/pre><pre class=\"codeoutput\">Peak write speed on the host is 44.6868 GB\/s\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2012\/gpu_benchmarking_02.png\" alt=\"\"> <p>It is clear that GPUs can read and write their memory much faster than they can get data from the host. Therefore, when writing code you must minimize the number of host-GPU or GPU-host transfers. You must transfer the data to the GPU, then do as much with it as possible whilst on the GPU, and only bring it back to the host when you absolutely need to. Even better, create the data on the GPU to start with if you can.<\/p><h4>Test Computation-Intensive Calculations<a name=\"b99b5df4-cdeb-456b-bef9-5236a9082ec7\"><\/a><\/h4><p>For operations where computation dominates, the memory speed is much less important. In this case you are probably more interested in how fast the computations are performed. A good test of computational performance is a matrix-matrix multiply. For multiplying two NxN matrices, the total number of floating-point calculations is<\/p><p>$FLOPS(N) = 2N^3 - N^2$<\/p><p>As above, I time this operation on both the host PC and the GPU to see their relative processing power:<\/p><pre class=\"codeinput\">sizes = power(2, 12:2:24);\r\nN = sqrt(sizes);\r\nmmTimesHost = inf(size(sizes));\r\nmmTimesGPU = inf(size(sizes));\r\n<span class=\"keyword\">for<\/span> ii=1:numel(sizes)\r\n    A = rand( N(ii), N(ii) );\r\n    B = rand( N(ii), N(ii) );\r\n    <span class=\"comment\">% First do it on the host<\/span>\r\n    <span class=\"keyword\">for<\/span> rr=1:repeats\r\n        timer = tic();\r\n        C = A*B; <span class=\"comment\">%#ok&lt;NASGU&gt;<\/span>\r\n        mmTimesHost(ii) = min( mmTimesHost(ii), toc(timer));\r\n    <span class=\"keyword\">end<\/span>\r\n    <span class=\"comment\">% Now on the GPU<\/span>\r\n    A = gpuArray(A);\r\n    B = gpuArray(B);\r\n    <span class=\"keyword\">for<\/span> rr=1:repeats\r\n        timer = tic();\r\n        C = A*B; <span class=\"comment\">%#ok&lt;NASGU&gt;<\/span>\r\n        wait(gpu);\r\n        mmTimesGPU(ii) = min( mmTimesGPU(ii), toc(timer));\r\n    <span class=\"keyword\">end<\/span>\r\n<span class=\"keyword\">end<\/span>\r\nmmGFlopsHost = (2*N.^3 - N.^2).\/mmTimesHost\/1e9;\r\n[maxGFlopsHost,maxGFlopsHostIdx] = max(mmGFlopsHost);\r\nmmGFlopsGPU = (2*N.^3 - N.^2).\/mmTimesGPU\/1e9;\r\n[maxGFlopsGPU,maxGFlopsGPUIdx] = max(mmGFlopsGPU);\r\nfprintf(<span class=\"string\">'Peak calculation rate: %1.1f GFLOPS (host), %1.1f GFLOPS (GPU)\\n'<\/span>, <span class=\"keyword\">...<\/span>\r\n    maxGFlopsHost, maxGFlopsGPU)\r\n<\/pre><pre class=\"codeoutput\">Peak calculation rate: 73.7 GFLOPS (host), 330.9 GFLOPS (GPU)\r\n<\/pre><p>Now plot it to see where the peak was achieved.<\/p><pre class=\"codeinput\">hold <span class=\"string\">off<\/span>\r\nsemilogx(sizes, mmGFlopsGPU, <span class=\"string\">'b.-'<\/span>, sizes, mmGFlopsHost, <span class=\"string\">'k.-'<\/span>)\r\nhold <span class=\"string\">on<\/span>\r\nsemilogx(sizes(maxGFlopsGPUIdx), maxGFlopsGPU, <span class=\"string\">'bo-'<\/span>, <span class=\"string\">'MarkerSize'<\/span>, 10);\r\nsemilogx(sizes(maxGFlopsHostIdx), maxGFlopsHost, <span class=\"string\">'ko-'<\/span>, <span class=\"string\">'MarkerSize'<\/span>, 10);\r\ngrid <span class=\"string\">on<\/span>\r\ntitle(<span class=\"string\">'Matrix-multiply calculation rate'<\/span>)\r\nxlabel(<span class=\"string\">'Matrix size (edge length)'<\/span>)\r\nylabel(<span class=\"string\">'Calculation Rate (GFLOPS)'<\/span>)\r\nlegend(<span class=\"string\">'GPU'<\/span>,<span class=\"string\">'Host'<\/span>,<span class=\"string\">'Location'<\/span>,<span class=\"string\">'NorthWest'<\/span>)\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2012\/gpu_benchmarking_03.png\" alt=\"\"> <h4>Comparing GPUs<a name=\"3bb2db18-e8e7-43c6-8e60-66d0633602cc\"><\/a><\/h4><p>After measuring both the memory bandwidth and calculation performance, I can now compare my GPU to others.  Previously I ran these tests on a couple of different GPUs and stored the results in a data-file.<\/p><pre class=\"codeinput\">offline = load(<span class=\"string\">'gpuBenchmarkResults.mat'<\/span>);\r\nnames = [<span class=\"string\">'This GPU'<\/span> <span class=\"string\">'This host'<\/span> offline.names];\r\nioData = [maxBWGPU maxBWHost offline.memoryBandwidth];\r\ncalcData = [maxGFlopsGPU maxGFlopsHost offline.mmGFlops];\r\n\r\nsubplot(1,2,1)\r\nbar( [ioData(:),nan(numel(ioData),1)]', <span class=\"string\">'grouped'<\/span> );\r\nset( gca, <span class=\"string\">'Xlim'<\/span>, [0.6 1.4], <span class=\"string\">'XTick'<\/span>, [] );\r\nlegend(names{:})\r\ntitle(<span class=\"string\">'Memory Bandwidth'<\/span>), ylabel(<span class=\"string\">'GB\/sec'<\/span>)\r\n\r\nsubplot(1,2,2)\r\nbar( [calcData(:),nan(numel(calcData),1)]', <span class=\"string\">'grouped'<\/span> );\r\nset( gca, <span class=\"string\">'Xlim'<\/span>, [0.6 1.4], <span class=\"string\">'XTick'<\/span>, [] );\r\ntitle(<span class=\"string\">'Calculation Speed'<\/span>), ylabel(<span class=\"string\">'GFLOPS'<\/span>)\r\n\r\nset(gcf, <span class=\"string\">'Position'<\/span>, get(gcf,<span class=\"string\">'Position'<\/span>)+[0 0 300 0]);\r\n<\/pre><img decoding=\"async\" vspace=\"5\" hspace=\"5\" src=\"https:\/\/blogs.mathworks.com\/images\/loren\/2012\/gpu_benchmarking_04.png\" alt=\"\"> <h4>Conclusions<a name=\"ce5dce69-129b-44a7-87ea-8b7c8a9ac158\"><\/a><\/h4><p>These tests reveal a few things about how GPUs behave:<\/p><div><ul><li>Transfers from host memory to GPU memory and back are relatively slow, &lt;6GB\/s in my case.<\/li><li>A good GPU can read\/write its memory much faster than the host PC can read\/write its memory.<\/li><li>Given large enough data, GPUs can perform calculations much faster than the host PC, more than four times faster in my case.<\/li><\/ul><\/div><p>Noticable in each test is that you need quite large arrays to fully saturate your GPU, whether limited by memory or by computation. You get the most from your GPU when working with millions of elements at once.<\/p><p>If you are interested in a more detailed benchmark of your GPU's performance, have a look at <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/340-saveppt\">GPUBench<\/a> on the <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\">MATLAB Central File Exchange<\/a>.<\/p><p>If you have questions about these measurements or spot something I've done wrong or that could be improved, leave me a comment <a href=\"https:\/\/blogs.mathworks.com\/loren\/?p=588#respond\">here<\/a>.<\/p><script language=\"JavaScript\"> <!-- \r\n    function grabCode_05a6c77039c348ad8c824fc0561ee777() {\r\n        \/\/ Remember the title so we can use it in the new page\r\n        title = document.title;\r\n\r\n        \/\/ Break up these strings so that their presence\r\n        \/\/ in the Javascript doesn't mess up the search for\r\n        \/\/ the MATLAB code.\r\n        t1='05a6c77039c348ad8c824fc0561ee777 ' + '##### ' + 'SOURCE BEGIN' + ' #####';\r\n        t2='##### ' + 'SOURCE END' + ' #####' + ' 05a6c77039c348ad8c824fc0561ee777';\r\n    \r\n        b=document.getElementsByTagName('body')[0];\r\n        i1=b.innerHTML.indexOf(t1)+t1.length;\r\n        i2=b.innerHTML.indexOf(t2);\r\n \r\n        code_string = b.innerHTML.substring(i1, i2);\r\n        code_string = code_string.replace(\/REPLACE_WITH_DASH_DASH\/g,'--');\r\n\r\n        \/\/ Use \/x3C\/g instead of the less-than character to avoid errors \r\n        \/\/ in the XML parser.\r\n        \/\/ Use '\\x26#60;' instead of '<' so that the XML parser\r\n        \/\/ doesn't go ahead and substitute the less-than character. \r\n        code_string = code_string.replace(\/\\x3C\/g, '\\x26#60;');\r\n\r\n        copyright = 'Copyright 2012 The MathWorks, Inc.';\r\n\r\n        w = window.open();\r\n        d = w.document;\r\n        d.write('<pre>\\n');\r\n        d.write(code_string);\r\n\r\n        \/\/ Add copyright line at the bottom if specified.\r\n        if (copyright.length > 0) {\r\n            d.writeln('');\r\n            d.writeln('%%');\r\n            if (copyright.length > 0) {\r\n                d.writeln('% _' + copyright + '_');\r\n            }\r\n        }\r\n\r\n        d.write('<\/pre>\\n');\r\n\r\n        d.title = title + ' (MATLAB code)';\r\n        d.close();\r\n    }   \r\n     --> <\/script><p style=\"text-align: right; font-size: xx-small; font-weight:lighter;   font-style: italic; color: gray\"><br><a href=\"javascript:grabCode_05a6c77039c348ad8c824fc0561ee777()\"><span style=\"font-size: x-small;        font-style: italic;\">Get \r\n      the MATLAB code <noscript>(requires JavaScript)<\/noscript><\/span><\/a><br><br>\r\n      Published with MATLAB&reg; R2012b<br><\/p><p class=\"footer\"><br>\r\n      Published with MATLAB&reg; R2012b<br><\/p><\/div><!--\r\n05a6c77039c348ad8c824fc0561ee777 ##### SOURCE BEGIN #####\r\n%% Measuring GPU Performance\r\n% Today I welcome back guest blogger\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/80363 Ben Tordoff>\r\n% who previously wrote here on how to\r\n% <https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\r\n% generate a fractal on a GPU>. He is going to continue this GPU\r\n% theme below, looking at how to measure the performance of a GPU.\r\n\r\n%% Measuring GPU Performance\r\n% Whether you are thinking about buying yourself a new beefy GPU or have\r\n% just splashed-out on one, you may well be asking yourself how fast it is.\r\n% In this article I will describe and attempt to measure some of the key\r\n% performance characteristics of a GPU. This should give you some insight\r\n% into the relative merits of using the GPU over the CPU and also some idea\r\n% of how different GPUs compare to each other.\r\n%\r\n% There are a vast array of benchmarks to choose from, so I have\r\n% narrowed this down to three tests.:\r\n%\r\n% * How quickly can we send data to the GPU or read it back again?\r\n% * How fast can the GPU kernel read and write data?\r\n% * How fast can the GPU do computations?\r\n%\r\n% After measuring each of these, I can compare my GPU with other\r\n% GPUs.\r\n\r\n%% How Timing Is Measured\r\n% In the following sections, each test is repeated many times to allow for\r\n% other activities going on on my PC and the first-call overheads. I keep\r\n% the minimum of the results, because external factors can only ever slow\r\n% down execution.\r\n%\r\n% To get accurate timing figures I use |wait(gpu)| to ensure the GPU has\r\n% finished working before stopping the timer. You should not do this in\r\n% normal code. For best performance you want to let the GPU carry on\r\n% working whilst the CPU gets on with other things. MATLAB automatically\r\n% takes care of any synchronisation that is required.\r\n%\r\n% I have put the code into a function so that variables are scoped. This\r\n% can make a big difference in terms of memory performance since MATLAB is\r\n% better able to re-use arrays.\r\nfunction gpu_benchmarking\r\ngpu = gpuDevice();\r\nfprintf('I have a %s GPU.\\n', gpu.Name)\r\n\r\n%% Test Host\/GPU Bandwidth\r\n% The first test tries to measure how quickly data can be sent-to and\r\n% read-from the GPU. Since the GPU is plugged into the PCI bus, this\r\n% largely depends on how good your PCI bus is and how many other things are\r\n% using it. However, there are also some overheads that are included in the\r\n% measurements, particularly the function call overhead and the array\r\n% allocation time. Since these are present in any \"real world\" use of the\r\n% GPU it is reasonable to include these.\r\n%\r\n% In the following tests, data is allocated\/sent to the GPU\r\n% using the |gpuArray| function and allocated\/returned to host memory\r\n% using |gather|. The arrays are created using |uint8| so that each element\r\n% is a single byte.\r\n%\r\n% Note that PCI express v2, as used in this test, has a theoretical\r\n% bandwidth of 0.5GB\/s per lane. For the 16-lane slots (PCIe2 x16) used by\r\n% NVIDIA's Tesla cards this gives a theoretical 8GB\/s.\r\nsizes = power(2, 12:26);\r\nrepeats = 10;\r\n\r\nsendTimes = inf(size(sizes));\r\ngatherTimes = inf(size(sizes));\r\nfor ii=1:numel(sizes)\r\n    data = randi([0 255], sizes(ii), 1, 'uint8');\r\n    for rr=1:repeats\r\n        timer = tic();\r\n        gdata = gpuArray(data);\r\n        wait(gpu);\r\n        sendTimes(ii) = min(sendTimes(ii), toc(timer));\r\n        \r\n        timer = tic();\r\n        data2 = gather(gdata); %#ok<NASGU>\r\n        gatherTimes(ii) = min(gatherTimes(ii), toc(timer));\r\n    end\r\nend\r\nsendBandwidth = (sizes.\/sendTimes)\/1e9;\r\n[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);\r\nfprintf('Peak send speed is %g GB\/s\\n',maxSendBandwidth)\r\ngatherBandwidth = (sizes.\/gatherTimes)\/1e9;\r\n[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);\r\nfprintf('Peak gather speed is %g GB\/s\\n',max(gatherBandwidth))\r\n\r\n\r\n%%\r\n% On the plot, you can see where the peak was achieved in each case\r\n% (circled). At small sizes, the bandwidth of the PCI bus is irrelevant\r\n% since the overheads dominate. At larger sizes the PCI bus is the limiting\r\n% factor and the curves flatten out. Since the PC and all of the GPUs I\r\n% have use the same PCI v2, there is little merit in comparing different\r\n% GPUs. PCI v3 hardware is starting to appear though, so maybe this will\r\n% become more interesting in future.\r\nhold off\r\nsemilogx(sizes, sendBandwidth, 'b.-', sizes, gatherBandwidth, 'r.-')\r\nhold on\r\nsemilogx(sizes(maxSendIdx), maxSendBandwidth, 'bo-', 'MarkerSize', 10);\r\nsemilogx(sizes(maxGatherIdx), maxGatherBandwidth, 'ro-', 'MarkerSize', 10);\r\ngrid on\r\ntitle('Data Transfer Bandwidth')\r\nxlabel('Array size (bytes)')\r\nylabel('Transfer speed (GB\/s)')\r\nlegend('Send','Gather','Location','NorthWest')\r\n\r\n%% Test Memory-Intensive Operations\r\n% Many operations you might want to perform do very little computation with\r\n% each element of an array and are therefore dominated by the time taken to\r\n% fetch the data from memory or write it back. Functions such as ONES,\r\n% ZEROS, NAN, TRUE only write their output, whereas functions like\r\n% TRANSPOSE, TRIL\/TRIU both read and write but do no computation. Even\r\n% simple operators like PLUS, MINUS, MTIMES do so little computation\r\n% per element that they are bound only by the memory access speed.\r\n%\r\n% I can use a simple PLUS operation to measure how fast my machine can read\r\n% and write memory. This involves reading each double precision number\r\n% (i.e., 8 bytes per element of the input), adding one and then writing it\r\n% out again (i.e., another 8 bytes per element).\r\nsizeOfDouble = 8;\r\nreadWritesPerElement = 2;\r\nmemoryTimesGPU = inf(size(sizes));\r\nfor ii=1:numel(sizes)\r\n    numElements = sizes(ii)\/sizeOfDouble;\r\n    data = gpuArray.zeros(numElements, 1, 'double');\r\n    for rr=1:repeats\r\n        timer = tic();\r\n        for jj=1:100\r\n            data = data + 1;\r\n        end\r\n        wait(gpu);\r\n        memoryTimesGPU(ii) = min(memoryTimesGPU(ii), toc(timer)\/100);\r\n    end\r\nend\r\nmemoryBandwidth = readWritesPerElement*(sizes.\/memoryTimesGPU)\/1e9;\r\n[maxBWGPU, maxBWIdxGPU] = max(memoryBandwidth);\r\nfprintf('Peak read\/write speed on the GPU is %g GB\/s\\n',maxBWGPU)\r\n\r\n%%\r\n% To know whether this is fast or not, I compare it with the same code\r\n% running on the CPU. Note, however, that the CPU has several levels of\r\n% caching and some oddities like \"read before write\" that can make the\r\n% results look a little odd. For my PC the theoretical bandwidth of main\r\n% memory is 32GB\/s, so anything above this is likely to be due to efficient\r\n% caching. \r\nmemoryTimesHost = inf(size(sizes));\r\nfor ii=1:numel(sizes)\r\n    numElements = sizes(ii)\/sizeOfDouble;\r\n    for rr=1:repeats\r\n        hostData = zeros(numElements,1);\r\n        timer = tic();\r\n        for jj=1:100\r\n            hostData = hostData + 1;\r\n        end\r\n        memoryTimesHost(ii) = min(memoryTimesHost(ii), toc(timer)\/100);\r\n    end\r\nend\r\nmemoryBandwidthHost = 2*(sizes.\/memoryTimesHost)\/1e9;\r\n[maxBWHost, maxBWHostIdx] = max(memoryBandwidthHost);\r\nfprintf('Peak write speed on the host is %g GB\/s\\n',maxBWHost)\r\n\r\n% Plot CPU and GPU results.\r\nhold off\r\nsemilogx(sizes, memoryBandwidth, 'b.-', ...\r\n    sizes, memoryBandwidthHost, 'k.-')\r\nhold on\r\nsemilogx(sizes(maxBWIdxGPU), maxBWGPU, 'bo-', 'MarkerSize', 10);\r\nsemilogx(sizes(maxBWHostIdx), maxBWHost, 'ko-', 'MarkerSize', 10);\r\ngrid on\r\ntitle('Read\/Write Bandwidth')\r\nxlabel('Array size (bytes)')\r\nylabel('Speed (GB\/s)')\r\nlegend('Read+Write (GPU)','Read+Write (host)','Location','NorthWest')\r\n\r\n%%\r\n% It is clear that GPUs can read and write their memory much faster\r\n% than they can get data from the host. Therefore, when writing code you\r\n% must minimize the number of host-GPU or GPU-host transfers. You must\r\n% transfer the data to the GPU, then do as much with it as possible whilst\r\n% on the GPU, and only bring it back to the host when you absolutely need\r\n% to. Even better, create the data on the GPU to start with if you can.\r\n\r\n\r\n%% Test Computation-Intensive Calculations\r\n% For operations where computation dominates, the memory speed is much less\r\n% important. In this case you are probably more interested in how fast the\r\n% computations are performed. A good test of computational performance is a\r\n% matrix-matrix multiply. For multiplying two NxN matrices, the total\r\n% number of floating-point calculations is\r\n%\r\n% $FLOPS(N) = 2N^3 - N^2$\r\n%\r\n% As above, I time this operation on both the host PC and the GPU to see\r\n% their relative processing power: \r\nsizes = power(2, 12:2:24);\r\nN = sqrt(sizes);\r\nmmTimesHost = inf(size(sizes));\r\nmmTimesGPU = inf(size(sizes));\r\nfor ii=1:numel(sizes)\r\n    A = rand( N(ii), N(ii) );\r\n    B = rand( N(ii), N(ii) );\r\n    % First do it on the host\r\n    for rr=1:repeats\r\n        timer = tic();\r\n        C = A*B; %#ok<NASGU>\r\n        mmTimesHost(ii) = min( mmTimesHost(ii), toc(timer));\r\n    end\r\n    % Now on the GPU\r\n    A = gpuArray(A);\r\n    B = gpuArray(B);\r\n    for rr=1:repeats\r\n        timer = tic();\r\n        C = A*B; %#ok<NASGU>\r\n        wait(gpu);\r\n        mmTimesGPU(ii) = min( mmTimesGPU(ii), toc(timer));\r\n    end\r\nend\r\nmmGFlopsHost = (2*N.^3 - N.^2).\/mmTimesHost\/1e9;\r\n[maxGFlopsHost,maxGFlopsHostIdx] = max(mmGFlopsHost);\r\nmmGFlopsGPU = (2*N.^3 - N.^2).\/mmTimesGPU\/1e9;\r\n[maxGFlopsGPU,maxGFlopsGPUIdx] = max(mmGFlopsGPU);\r\nfprintf('Peak calculation rate: %1.1f GFLOPS (host), %1.1f GFLOPS (GPU)\\n', ...\r\n    maxGFlopsHost, maxGFlopsGPU)\r\n\r\n%%\r\n% Now plot it to see where the peak was achieved.\r\nhold off\r\nsemilogx(sizes, mmGFlopsGPU, 'b.-', sizes, mmGFlopsHost, 'k.-')\r\nhold on\r\nsemilogx(sizes(maxGFlopsGPUIdx), maxGFlopsGPU, 'bo-', 'MarkerSize', 10);\r\nsemilogx(sizes(maxGFlopsHostIdx), maxGFlopsHost, 'ko-', 'MarkerSize', 10);\r\ngrid on\r\ntitle('Matrix-multiply calculation rate')\r\nxlabel('Matrix size (edge length)')\r\nylabel('Calculation Rate (GFLOPS)')\r\nlegend('GPU','Host','Location','NorthWest')\r\n\r\n\r\n%% Comparing GPUs\r\n% After measuring both the memory bandwidth and calculation performance, I\r\n% can now compare my GPU to others.  Previously I ran these tests on a\r\n% couple of different GPUs and stored the results in a data-file.\r\noffline = load('gpuBenchmarkResults.mat');\r\nnames = ['This GPU' 'This host' offline.names];\r\nioData = [maxBWGPU maxBWHost offline.memoryBandwidth];\r\ncalcData = [maxGFlopsGPU maxGFlopsHost offline.mmGFlops];\r\n\r\nsubplot(1,2,1)\r\nbar( [ioData(:),nan(numel(ioData),1)]', 'grouped' );\r\nset( gca, 'Xlim', [0.6 1.4], 'XTick', [] );\r\nlegend(names{:})\r\ntitle('Memory Bandwidth'), ylabel('GB\/sec')\r\n\r\nsubplot(1,2,2)\r\nbar( [calcData(:),nan(numel(calcData),1)]', 'grouped' );\r\nset( gca, 'Xlim', [0.6 1.4], 'XTick', [] );\r\ntitle('Calculation Speed'), ylabel('GFLOPS')\r\n\r\nset(gcf, 'Position', get(gcf,'Position')+[0 0 300 0]);\r\n\r\n\r\n%% Conclusions\r\n% These tests reveal a few things about how GPUs behave:\r\n%\r\n% * Transfers from host memory to GPU memory and back are relatively slow,\r\n% <6GB\/s in my case.\r\n% * A good GPU can read\/write its memory much faster than the host PC can\r\n% read\/write its memory.\r\n% * Given large enough data, GPUs can perform calculations much faster than\r\n% the host PC, more than four times faster in my case.\r\n%\r\n% Noticable in each test is that you need quite large arrays to fully\r\n% saturate your GPU, whether limited by memory or by computation. You get\r\n% the most from your GPU when working with millions of elements at once.\r\n%\r\n% If you are interested in a more detailed benchmark of your GPU's\r\n% performance, have a look at\r\n% <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/340-saveppt GPUBench> on\r\n% the <https:\/\/www.mathworks.com\/matlabcentral\/fileexchange MATLAB Central\r\n% File Exchange>.\r\n%\r\n% If you have questions about these measurements or spot something I've\r\n% done wrong or that could be improved, leave me a comment\r\n% <https:\/\/blogs.mathworks.com\/loren\/?p=588#respond here>.\r\n##### SOURCE END ##### 05a6c77039c348ad8c824fc0561ee777\r\n-->","protected":false},"excerpt":{"rendered":"<!--introduction--><p>Today I welcome back guest blogger <a href=\"https:\/\/www.mathworks.com\/matlabcentral\/fileexchange\/authors\/80363\">Ben Tordoff<\/a> who previously wrote here on how to <a href=\"https:\/\/blogs.mathworks.com\/loren\/2011\/07\/18\/a-mandelbrot-set-on-the-gpu\">generate a fractal on a GPU<\/a>. He is going to continue this GPU theme below, looking at how to measure the performance of a GPU.... <a class=\"read-more\" href=\"https:\/\/blogs.mathworks.com\/loren\/2012\/12\/14\/measuring-gpu-performance\/\">read more >><\/a><\/p>","protected":false},"author":39,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[55,34],"tags":[],"_links":{"self":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/588"}],"collection":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/comments?post=588"}],"version-history":[{"count":14,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/588\/revisions"}],"predecessor-version":[{"id":716,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/posts\/588\/revisions\/716"}],"wp:attachment":[{"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/media?parent=588"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/categories?post=588"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.mathworks.com\/loren\/wp-json\/wp\/v2\/tags?post=588"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}