# Measuring GPU Performance12

Posted by Loren Shure,

Today I welcome back guest blogger Ben Tordoff who previously wrote here on how to generate a fractal on a GPU. He is going to continue this GPU theme below, looking at how to measure the performance of a GPU.

### Contents

#### Measuring GPU Performance

Whether you are thinking about buying yourself a new beefy GPU or have just splashed-out on one, you may well be asking yourself how fast it is. In this article I will describe and attempt to measure some of the key performance characteristics of a GPU. This should give you some insight into the relative merits of using the GPU over the CPU and also some idea of how different GPUs compare to each other.

There are a vast array of benchmarks to choose from, so I have narrowed this down to three tests.:

• How quickly can we send data to the GPU or read it back again?
• How fast can the GPU kernel read and write data?
• How fast can the GPU do computations?

After measuring each of these, I can compare my GPU with other GPUs.

#### How Timing Is Measured

In the following sections, each test is repeated many times to allow for other activities going on on my PC and the first-call overheads. I keep the minimum of the results, because external factors can only ever slow down execution.

To get accurate timing figures I use wait(gpu) to ensure the GPU has finished working before stopping the timer. You should not do this in normal code. For best performance you want to let the GPU carry on working whilst the CPU gets on with other things. MATLAB automatically takes care of any synchronisation that is required.

I have put the code into a function so that variables are scoped. This can make a big difference in terms of memory performance since MATLAB is better able to re-use arrays.

function gpu_benchmarking

gpu = gpuDevice();
fprintf('I have a %s GPU.\n', gpu.Name)

I have a Tesla C2075 GPU.


#### Test Host/GPU Bandwidth

The first test tries to measure how quickly data can be sent-to and read-from the GPU. Since the GPU is plugged into the PCI bus, this largely depends on how good your PCI bus is and how many other things are using it. However, there are also some overheads that are included in the measurements, particularly the function call overhead and the array allocation time. Since these are present in any "real world" use of the GPU it is reasonable to include these.

In the following tests, data is allocated/sent to the GPU using the gpuArray function and allocated/returned to host memory using gather. The arrays are created using uint8 so that each element is a single byte.

Note that PCI express v2, as used in this test, has a theoretical bandwidth of 0.5GB/s per lane. For the 16-lane slots (PCIe2 x16) used by NVIDIA's Tesla cards this gives a theoretical 8GB/s.

sizes = power(2, 12:26);
repeats = 10;

sendTimes = inf(size(sizes));
gatherTimes = inf(size(sizes));
for ii=1:numel(sizes)
data = randi([0 255], sizes(ii), 1, 'uint8');
for rr=1:repeats
timer = tic();
gdata = gpuArray(data);
wait(gpu);
sendTimes(ii) = min(sendTimes(ii), toc(timer));

timer = tic();
data2 = gather(gdata); %#ok<NASGU>
gatherTimes(ii) = min(gatherTimes(ii), toc(timer));
end
end
sendBandwidth = (sizes./sendTimes)/1e9;
[maxSendBandwidth,maxSendIdx] = max(sendBandwidth);
fprintf('Peak send speed is %g GB/s\n',maxSendBandwidth)
gatherBandwidth = (sizes./gatherTimes)/1e9;
[maxGatherBandwidth,maxGatherIdx] = max(gatherBandwidth);
fprintf('Peak gather speed is %g GB/s\n',max(gatherBandwidth))

Peak send speed is 5.70217 GB/s
Peak gather speed is 3.99077 GB/s


On the plot, you can see where the peak was achieved in each case (circled). At small sizes, the bandwidth of the PCI bus is irrelevant since the overheads dominate. At larger sizes the PCI bus is the limiting factor and the curves flatten out. Since the PC and all of the GPUs I have use the same PCI v2, there is little merit in comparing different GPUs. PCI v3 hardware is starting to appear though, so maybe this will become more interesting in future.

hold off
semilogx(sizes, sendBandwidth, 'b.-', sizes, gatherBandwidth, 'r.-')
hold on
semilogx(sizes(maxSendIdx), maxSendBandwidth, 'bo-', 'MarkerSize', 10);
semilogx(sizes(maxGatherIdx), maxGatherBandwidth, 'ro-', 'MarkerSize', 10);
grid on
title('Data Transfer Bandwidth')
xlabel('Array size (bytes)')
ylabel('Transfer speed (GB/s)')
legend('Send','Gather','Location','NorthWest')


#### Test Memory-Intensive Operations

Many operations you might want to perform do very little computation with each element of an array and are therefore dominated by the time taken to fetch the data from memory or write it back. Functions such as ONES, ZEROS, NAN, TRUE only write their output, whereas functions like TRANSPOSE, TRIL/TRIU both read and write but do no computation. Even simple operators like PLUS, MINUS, MTIMES do so little computation per element that they are bound only by the memory access speed.

I can use a simple PLUS operation to measure how fast my machine can read and write memory. This involves reading each double precision number (i.e., 8 bytes per element of the input), adding one and then writing it out again (i.e., another 8 bytes per element).

sizeOfDouble = 8;
memoryTimesGPU = inf(size(sizes));
for ii=1:numel(sizes)
numElements = sizes(ii)/sizeOfDouble;
data = gpuArray.zeros(numElements, 1, 'double');
for rr=1:repeats
timer = tic();
for jj=1:100
data = data + 1;
end
wait(gpu);
memoryTimesGPU(ii) = min(memoryTimesGPU(ii), toc(timer)/100);
end
end
[maxBWGPU, maxBWIdxGPU] = max(memoryBandwidth);
fprintf('Peak read/write speed on the GPU is %g GB/s\n',maxBWGPU)

Peak read/write speed on the GPU is 110.993 GB/s


To know whether this is fast or not, I compare it with the same code running on the CPU. Note, however, that the CPU has several levels of caching and some oddities like "read before write" that can make the results look a little odd. For my PC the theoretical bandwidth of main memory is 32GB/s, so anything above this is likely to be due to efficient caching.

memoryTimesHost = inf(size(sizes));
for ii=1:numel(sizes)
numElements = sizes(ii)/sizeOfDouble;
for rr=1:repeats
hostData = zeros(numElements,1);
timer = tic();
for jj=1:100
hostData = hostData + 1;
end
memoryTimesHost(ii) = min(memoryTimesHost(ii), toc(timer)/100);
end
end
memoryBandwidthHost = 2*(sizes./memoryTimesHost)/1e9;
[maxBWHost, maxBWHostIdx] = max(memoryBandwidthHost);
fprintf('Peak write speed on the host is %g GB/s\n',maxBWHost)

% Plot CPU and GPU results.
hold off
semilogx(sizes, memoryBandwidth, 'b.-', ...
sizes, memoryBandwidthHost, 'k.-')
hold on
semilogx(sizes(maxBWIdxGPU), maxBWGPU, 'bo-', 'MarkerSize', 10);
semilogx(sizes(maxBWHostIdx), maxBWHost, 'ko-', 'MarkerSize', 10);
grid on
xlabel('Array size (bytes)')
ylabel('Speed (GB/s)')

Peak write speed on the host is 44.6868 GB/s


It is clear that GPUs can read and write their memory much faster than they can get data from the host. Therefore, when writing code you must minimize the number of host-GPU or GPU-host transfers. You must transfer the data to the GPU, then do as much with it as possible whilst on the GPU, and only bring it back to the host when you absolutely need to. Even better, create the data on the GPU to start with if you can.

#### Test Computation-Intensive Calculations

For operations where computation dominates, the memory speed is much less important. In this case you are probably more interested in how fast the computations are performed. A good test of computational performance is a matrix-matrix multiply. For multiplying two NxN matrices, the total number of floating-point calculations is

$FLOPS(N) = 2N^3 - N^2$

As above, I time this operation on both the host PC and the GPU to see their relative processing power:

sizes = power(2, 12:2:24);
N = sqrt(sizes);
mmTimesHost = inf(size(sizes));
mmTimesGPU = inf(size(sizes));
for ii=1:numel(sizes)
A = rand( N(ii), N(ii) );
B = rand( N(ii), N(ii) );
% First do it on the host
for rr=1:repeats
timer = tic();
C = A*B; %#ok<NASGU>
mmTimesHost(ii) = min( mmTimesHost(ii), toc(timer));
end
% Now on the GPU
A = gpuArray(A);
B = gpuArray(B);
for rr=1:repeats
timer = tic();
C = A*B; %#ok<NASGU>
wait(gpu);
mmTimesGPU(ii) = min( mmTimesGPU(ii), toc(timer));
end
end
mmGFlopsHost = (2*N.^3 - N.^2)./mmTimesHost/1e9;
[maxGFlopsHost,maxGFlopsHostIdx] = max(mmGFlopsHost);
mmGFlopsGPU = (2*N.^3 - N.^2)./mmTimesGPU/1e9;
[maxGFlopsGPU,maxGFlopsGPUIdx] = max(mmGFlopsGPU);
fprintf('Peak calculation rate: %1.1f GFLOPS (host), %1.1f GFLOPS (GPU)\n', ...
maxGFlopsHost, maxGFlopsGPU)

Peak calculation rate: 73.7 GFLOPS (host), 330.9 GFLOPS (GPU)


Now plot it to see where the peak was achieved.

hold off
semilogx(sizes, mmGFlopsGPU, 'b.-', sizes, mmGFlopsHost, 'k.-')
hold on
semilogx(sizes(maxGFlopsGPUIdx), maxGFlopsGPU, 'bo-', 'MarkerSize', 10);
semilogx(sizes(maxGFlopsHostIdx), maxGFlopsHost, 'ko-', 'MarkerSize', 10);
grid on
title('Matrix-multiply calculation rate')
xlabel('Matrix size (edge length)')
ylabel('Calculation Rate (GFLOPS)')
legend('GPU','Host','Location','NorthWest')


#### Comparing GPUs

After measuring both the memory bandwidth and calculation performance, I can now compare my GPU to others. Previously I ran these tests on a couple of different GPUs and stored the results in a data-file.

offline = load('gpuBenchmarkResults.mat');
names = ['This GPU' 'This host' offline.names];
ioData = [maxBWGPU maxBWHost offline.memoryBandwidth];
calcData = [maxGFlopsGPU maxGFlopsHost offline.mmGFlops];

subplot(1,2,1)
bar( [ioData(:),nan(numel(ioData),1)]', 'grouped' );
set( gca, 'Xlim', [0.6 1.4], 'XTick', [] );
legend(names{:})
title('Memory Bandwidth'), ylabel('GB/sec')

subplot(1,2,2)
bar( [calcData(:),nan(numel(calcData),1)]', 'grouped' );
set( gca, 'Xlim', [0.6 1.4], 'XTick', [] );
title('Calculation Speed'), ylabel('GFLOPS')

set(gcf, 'Position', get(gcf,'Position')+[0 0 300 0]);


#### Conclusions

These tests reveal a few things about how GPUs behave:

• Transfers from host memory to GPU memory and back are relatively slow, <6GB/s in my case.
• A good GPU can read/write its memory much faster than the host PC can read/write its memory.
• Given large enough data, GPUs can perform calculations much faster than the host PC, more than four times faster in my case.

Noticable in each test is that you need quite large arrays to fully saturate your GPU, whether limited by memory or by computation. You get the most from your GPU when working with millions of elements at once.

If you are interested in a more detailed benchmark of your GPU's performance, have a look at GPUBench on the MATLAB Central File Exchange.

If you have questions about these measurements or spot something I've done wrong or that could be improved, leave me a comment here.

Get the MATLAB code

Published with MATLAB® R2012b

### Note

Walter Reade replied on : 1 of 12
Looks like the host/gpu label on the GFLOPS graph is switched. Enjoyed the post otherwise!
Aditya replied on : 2 of 12
wait(gpu) This gives the following error: ??? Undefined function or method 'wait' for input arguments of type 'parallel.gpu.CUDADevice'. Error in ==> gpu_benchmarking at 74 wait(gpu);
Ben replied on : 3 of 12
Hi Aditya, the "wait" method was introduced in MATLAB R2012a. If you're using an older version, just comment that line out.
Ben replied on : 4 of 12
Hi Walter, thanks for spotting that - I'll get it fixed right away.
Sebastian replied on : 5 of 12
Very interesting. I ran the tests on my computer with a i7-3770K CPU and the new GTX-680 GPU. On the host/GPU read/write and memory intensive parts, I get numbers that are comparable with the ones in the article. For the computation intensive part, it seems like the GTX-680 is not as fast where I only achieve a peak of about 111 GFLOPS on the GPU (78 on the CPU). When I switch to single precision (by converting A and B in the last test to singles) things change dramatically. The CPU’s performance increases from 78 to 170 GFLOPS but the GPU’s performance skyrockets from 111 GFLOPS to almost 1 TFLOP! (979 GFLOPS) Seems like the new Kepler architecture does not do so good with doubles but performs really good at single precision arithmetic. Do you know how this compares with the Tesla? Maybe if one has an application where single precision is OK, the GTX 680 is the better choice? It would be helpful if Mathworks published some benchmark tests for different GPU’s.
Sebastian replied on : 6 of 12
Sorry about the last comment - I missed the gpuBench utility where the information is presented!
sam replied on : 7 of 12
Im getting the following error, any idea why? tnx Undefined variable "gpuArray" or class "gpuArray.zeros". Error in GPUtest (line 127) data = gpuArray.zeros(numElements, 1, 'double');
Grzegorz Knor replied on : 8 of 12
Nice article ;) I've run test on my GPU, the result is as follow: I have a Tesla C2050 GPU. Peak send speed is 5.33924 GB/s Peak gather speed is 3.52982 GB/s Peak read/write speed on the GPU is 110.814 GB/s Peak calculation rate: 10.0 GFLOPS (host), 331.7 GFLOPS (GPU)
Ben replied on : 9 of 12
Hi Sam, there are three things you need to run this code: 1. Parallel Computing Toolbox 2. MATLAB R2012b or newer 3. A supported GPU If you have (1) then the most likely cause is an older version of MATLAB (i.e. 2). The names of the methods used to build GPU arrays were changed slightly in R2012b. If you have R2012a, replace things like gpuArray.zeros(...) with parallel.gpu.GPUArray.zeros(...) You should then be able to run this without problems. If you're using an even older release then you may have to make a few other changes too (see Aditya's question above).
Ben replied on : 10 of 12
Hi Sebastian, thanks for posting your results. As you've noticed, not all GPUs are equal. You can think of NVIDIA's cards as being divided ito two groups: graphics cards and compute cards. The graphics cards (the GTX range and others) target huge single-precision performance as this is what OpenGL, DirectX etc. need to power the latest games. They provide some limited double precision support, but it is a fraction of the single-precision performance. They also lack any error checking/correction on the memory, so you may get the occasional bit-flip with extended use. The compute cards (mostly the Tesla range) target computation, including double-precision performance. They have error-correcting memory. The latest Tesla K20 cards are capable of around a teraflop in double precision. Both the GTX 680 and the Tesla K20 are "Kepler architecture" cards, but the former is a graphics card and the latter a compute card. The double precision performance differs accordingly. I will be posting an update to gpuBench that includes K20 results in the next month or two.
Eric ANTERRIEU replied on : 11 of 12
I read your article on Measuring GPU Performances with Matlab with great interest. I did a similar job for a Tesla1060 within Matlab R2012b and with CUDA standalone code. With regards to memory bandwidth, CPU->GPU and GPU->CPU, I observed better performances with CUDA than within Matlab in both directions. Did you observe the same behavior? I guess it is the price to pay for working with a very high level langage, i.e. Matlab, rather than to work with CUDA... Could you confirm that the interface between Matlab and GPU is lowering the memory transfer bandwith? Do you have any references (papers/articles) on that? Best regards, Eric
Andras replied on : 12 of 12
A very good code and article! I'd suggest to add the single precision computational power evaluation too. Thanks!