Benchmarking a GPU
Contents
My Laptop
The laptop where I do most of my work is a ThinkPad T480s, with an Intel Core i7-8650U processor having a base operating frequency of 1.90 GHz. The CPU has four cores and a maximum Turbo frequency of 4.20 GHz. There is 16GB of main memory. I run the Microsoft Windows 10, 64-bit operating system. The retail price is around $1500. MATLAB uses the Intel Math Kernel Library, which is multithreaded. So, it can take advantage of all four cores.GPU
The GPU is an NVIDIA Titan V operating at 1.455 GHz. It has 12GB of memory, 640 Tensor cores and 5120 CUDA cores. It is housed in a separate Thunderbolt peripheral box which is several times larger than the laptop itself. This box has its own 240W power supply and a couple of fans. It retails for $2999.
Benchmarking
This has been a frustrating project. Trying to measure the speed of modern computers is very difficult. I can bring up the Task Manager performance meter on my machine. I can switch on airplane mode so that I am no longer connected to the Internet. I can shut down MATLAB and close all other windows, except the perf meter itself. I can wait until all transient activity is gone. But I still don't get consistent times. Here is a snapshot of the Task Manager.
Flops
I'm going to be measuring gigaflops. That's 10^9 flops. What is a flop? The inner loops of most matrix computations are either dot products,s = s + x(i)*y(i)or "daxpys", double a x plus y,
y(i) = a*x(i) + y(i)In either case, that's one multiplication, one addition, and a handful of indexing, load and store operations. Taken altogether, that's two floating point operations, or two flops. Each time through an inner loop counts as two flops. If we ignore cache and memory effects, the time required for a matrix computation is roughly proportional to the number of flops.
CPU, A\b
Here is a typical benchmark run for timing the iconic x = A\b.t = zeros(1,30); m = zeros(1,30); n = 0:500:15000; tspan = 1.0;
while ~get(stop,'value') k = randi(30); nk = n(k); A = randn(nk,nk); b = randn(nk,1); cnt = 0; tok = 0; tic while tok < tspan x = A\b; cnt = cnt + 1; tok = toc; end t(k) = t(k) + tok/cnt; m(k) = m(k) + 1; end
t = t./m; gigaflops = 2/3*n.^3./t/1.e9;This randomly varies the size of the matrix between 500-by-500 and 15,000-by-15,000. If it takes less than one second to compute A\b, the computation is repeated. It turns out that systems of order 3000 and less need to be repeated. At the two extremes, systems of order 500 need to be repeated over 100 times, while systems of order 15,000 require over 30 seconds. If I run this for more than an hour, each value of n is encountered several times and the average times settle down. After adding some annotation to
plot(n,t,'o')I get

giga = 2/3*n.^3./t/1.e9

CPU, A*B
Multiplication of two n -by- n matrices requires 2n^3 flops, three times as many as solving one linear system.

GPU, A\b
Now to the primary motivation for this project, the GPU. First, solving a linear system. Here we reach a significant obstacle -- there is only enough memory on this GPU to handle linear systems of order 21000 or so. For such systems the performance can reach one teraflop (10^12 flops). That is nowhere near the speed that could be reached if there were more memory for the GPU.

GPU, A*B
It is easy to break matrix multiplication into many small, parallel tasks that the GPU can handle.

A\b, Comparison
How does the GPU compare to the CPU on A\b?
A*B, Comparison
The story is very different for matrix multiplication.
Supercomputers, Then and Now
Over 40 years ago, at the beginnings of the LINPACK Benchmark, the fastest supercomputer in the world was the newly installed Cray-1 at NCAR, the National Center for Atmospheric Research in Boulder. That machine could solve a 100-by-100 linear system in 50 milliseconds. That's 14 megaflops (10^6 flops), about one-tenth of its top speed of 160 megaflops. At 6 teraflops, the machine on my desk is 37,500 times faster. Both the Cray-1 and the NVIDIA Titan V could profit from more memory. On the other hand, the fastest supercomputer in the world today is Summit, at Oak Ridge National Laboratory in Tennessee. Its 200 petaflops (10^15 flops) top speed in roughly 37,500 times faster than my laptop plus GPU. And, Summit has lots of memory.Published with MATLAB® R2018b
Comments
To leave a comment, please click here to sign in to your MathWorks Account or create a new one.