# Timing the FFT 3

Posted by **Steve Eddins**,

I've seen two questions recently about the speed of the `fft` function in MATLAB. First, a tech support question was forwarded to development. The user wanted to know how to predict the computation time for an FFT of a given length, N. This user was interested in values of N in the neighborhood of 4096 (2^12).

The second was a post in the MATLAB newsgroup comp.soft-sys.matlab. This user wondered why padding to the next power of two wasn't always the fastest way to compute the FFT.

Inspired by these questions, I want to show you today how to do some FFT benchmarking in MATLAB.

It turns out that, in general, the time required to execute an N-point FFT is proportional to N*log(N). For any particular value of N, though, the execution time can be hard to predict and depends on the number of prime factors of N (very roughly speaking). The variation in time between two close values of N can be as much as an order of magnitude.

Whenever I do FFT benchmarking, I find it very helpful to look at three sets of numbers:

- Powers of 2
- Composite numbers that are not powers of 2
- Prime numbers

Also, I have learned to look at plots that are log scale in N and that have roughly the same number of test values within each octave (or doubling) of N.

Constructing sets of N values along these lines takes a little thought. Here's some code.

First, how many powers of 2 do we want to examine? Based on the customer questions I saw, I want to examine the range from 1024 (2^10) to 131072 (2^17).

low2 = 10; high2 = 17; n2 = 2.^(low2:high2);

Next, I want to pick 10 composite numbers and 10 prime numbers in between successive powers of 2. I'd like to pick the numbers "randomly," but I also want my experiment to be repeatable. To satisfy these seemingly contradictory constraints, I'll reset the MATLAB random number generator before beginning.

rng('default'); % Initialize the vectors holding the prime N's and composite N's. np = []; nc = []; for m = low2:high2 k = (2^m):(2^(m+1)); kp = k(2:end-1); isp = isprime(kp); primes = kp(isp); composites = kp(~isp); % Use randperm to pick out 10 values from the vector of primes and 10 % values from the vector of composites. new_np = primes(randperm(length(primes),10)); new_nc = composites(randperm(length(composites),10)); np = [np new_np]; nc = [nc new_nc]; end

Now let's use the function `timeit` to measure the execution time required to compute FFTs for all these values of N. (If you don't have a recent version of MATLAB that has `timeit`, you can get a version of it from the File Exchange.)

t2 = zeros(size(n2)); for k = 1:length(n2) x = rand(n2(k),1); t2(k) = timeit(@() fft(x)); end

tp = zeros(size(np)); for k = 1:length(np) x = rand(np(k),1); tp(k) = timeit(@() fft(x)); end

tc = zeros(size(np)); for k = 1:length(nc) x = rand(nc(k),1); tc(k) = timeit(@() fft(x)); end

Now do a loglog plot of all these times.

loglog(n2,t2,'o') set(gca,'xtick',2.^(10:17)) xlim([2^10 2^17]) hold on loglog(nc,tc,'+') loglog(np,tp,'*') hold off legend({'Powers of 2','Composite numbers','Prime numbers'}, ... 'Location','NorthWest') xlabel('N') ylabel('Execution time (s)') title('FFT execution time as a function of N')

You can see that there's a wide spread of execution times for the values of N that are not powers of 2.

One thing I'm not seeing is what the MATLAB Newsgroup poster reported. That is, I don't see a non-power-of-2 time that's faster than the next highest power of 2.

So let's look a little harder for composite numbers that are faster than what we've seen so far. Specifically, I'm going to look for values of N with prime factors no bigger than 3.

nn = []; for m = low2:high2 k = (2^m):(2^(m+1)); kp = k(2:end-1); kp = kp(randperm(length(kp))); nn_m = []; for q = 1:length(kp) if max(factor(kp(q))) <= 3 nn_m = [nn_m kp(q)]; if length(nn_m) >= 4 % We've found enough in this part of the range. break end end end nn = [nn nn_m]; end

Measure execution times for these "highly composite" numbers.

tn = zeros(length(nn),1); for k = 1:length(nn) x = rand(nn(k),1); tn(k) = timeit(@() fft(x)); end

Add the times to the plot.

hold on loglog(nn,tn,'d') hold off legend({'Powers of 2','Composite numbers','Prime numbers', ... 'Highly composite numbers'},'Location','NorthWest')

You can see that *sometimes* a non-power-of-2 can be computed very fast, faster than the next higher power of 2. In this experiment we found one such value of N: 39366. This number has 10 prime factors:

factor(39366)

ans = 2 3 3 3 3 3 3 3 3 3

I hope you enjoyed these experiments with FFT benchmarking. I can tell you from personal experience that it can turn into almost a full-time hobby!

Get the MATLAB code

Published with MATLAB® R2014a

## 3 CommentsOldest to Newest

**1**of 3

Very thorough and informative, thank you! I learned from experience that the “next power of 2” trick often doesn’t help, especially with fft2 (not fft). The fact is that the additional number of elements under consideration can trump the benefits of low prime factors.

This is even more true for the 2D case, since it is done essentially by running an fft for each column and afterwards for each row. In the N-D case you can easily realize that the complexity grows exponentially with the dimension size (which is expected, since the algorithm must fill an exponentially large number of elements).

**2**of 3

I have taken the liberty to expand on your script to do the following:

1. Run the FFTW planner

2. Compare performance to peak FLOPs of your cpu (assuming the arithmetic count of split-radix)

function fft_speed() optimize = 0;%1 if optimize fftw('planner', 'measure'); else fftw('planner', 'estimate'); load str fftw('wisdom', str); end low2 = 7; high2 = 24; n2 = 2.^(low2:high2); rng('default'); % Initialize the vectors holding the prime N's and composite N's. np = []; nc = []; for m = low2:high2 k = (2^m):(2^(m+1)); kp = k(2:end-1); isp = isprime(kp); primes = kp(isp); composites = kp(~isp); % Use randperm to pick out 10 values from the vector of primes and 10 % values from the vector of composites. new_np = primes(randperm(length(primes),10)); new_nc = composites(randperm(length(composites),10)); np = [np new_np]; nc = [nc new_nc]; end t2 = zeros(size(n2)); for k = 1:length(n2) x = rand(n2(k),1); if optimize fft(x); else t2(k) = timeit(@() fft(x)); end end tp = zeros(size(np)); for k = 1:length(np) x = rand(np(k),1); if optimize fft(x); else tp(k) = timeit(@() fft(x)); end end tc = zeros(size(np)); for k = 1:length(nc) x = rand(nc(k),1); if optimize fft(x); else tc(k) = timeit(@() fft(x)); end end nn = []; for m = low2:high2 k = (2^m):(2^(m+1)); kp = k(2:end-1); kp = kp(randperm(length(kp))); nn_m = []; for q = 1:length(kp) if max(factor(kp(q))) = 4 % We've found enough in this part of the range. break end end end nn = [nn nn_m]; end tn = zeros(length(nn),1); for k = 1:length(nn) x = rand(nn(k),1); if optimize fft(x); else tn(k) = timeit(@() fft(x)); end end [~,i] = sort(n2); n2 = n2(i); t2 = t2(i); [~,i] = sort(nc); nc = nc(i); tc = tc(i); [~,i] = sort(np); np = np(i); tp = tp(i); [~,i] = sort(nn); nn = nn(i); tn = tn(i); figure(1) loglog(n2,t2,'o:') set(gca,'xtick',2.^(low2:high2)) xlim([2^low2 2^high2]) hold all loglog(nc,tc,'+:') loglog(np,tp,'*:') loglog(nn,tn,'d:') hold off xlabel('N') ylabel('Execution time (s)') title('FFT execution time as a function of N') legend({'Powers of 2','Composite numbers','Prime numbers', ... 'Highly composite numbers'},'Location','NorthWest') %% fill with specific info about cpu cache = [[32 256]*1e3 [6 128]*1e6];%l1,l2,... cache in byte cpu_clock = 2.3e9;%3.9e9;%cpu clock in Hz simd_width = [4 8];%doubles per register fusing = 2;%2 if fused multiply-add available latency = 1; simd_units = 2; cores = 4; haswell_core_flop_peak = cpu_clock*simd_units*simd_width*fusing/latency; disp([num2str(haswell_core_flop_peak*1e-9), ' TFLOP (DP/SP) (per core theoretical peak)']); haswell_cpu_flop_peak = cores*cpu_clock*simd_units*simd_width*fusing/latency; disp([num2str(haswell_cpu_flop_peak*1e-9), ' TFLOP (DP/SP) (cpu-wide theoretical peak)']); figure(2) clf semilogx(n2,(4*n2.*log2(n2)-6*n2+8)./t2,'o:') set(gca,'xtick',2.^(low2:high2)) xlim([2^low2 2^high2]) hold all %http://en.wikipedia.org/wiki/Split-radix_FFT_algorithm semilogx(nc,(4*nc.*log2(nc)-6*nc+8)./tc,'+:') semilogx(np,(4*np.*log2(np)-6*np+8)./tp,'*:') semilogx(nn,(4*nn.*log2(nn)-6*nn+8)./tn','d:') semilogx(n2,haswell_core_flop_peak(1)*ones(size(n2)),':') for c = 1:length(cache) semilogx((cache(c)/(2*8))*ones(2,1),[0 haswell_core_flop_peak(1)],'k-.') end hold off legend({'Powers of 2','Composite numbers','Prime numbers', 'Highly composite numbers', 'cpu theoretical peak', 'cache boundaries in complex doubles'}, ... 'Location','NorthWest') xlabel('N') ylabel('DP FLOPs') title('FFT single-threaded dFLOPS (assumed split-radix) as a function of N') if optimize str = fftw('wisdom'); save str end

**3**of 3

This was discussed 3 years ago in CSSM thread 304045. The conclusion was that in many cases, transform lengths of repeated small prime factors are faster than the next higher power-of-2 (for example, 81=3^4 would be faster than 128=2^7). The specific comparison between potential window sizes should be done on the target computer, since it is highly affected by hardware issues such as cache size and memory boundaries/alignment.

An interesting theoretical breakthrough in FFT performance was announced by MIT researchers in 2012. The resulting library, sFFT (sparse FFT), is available under an MIT open-source license. A version of sFFT that is hardware optimized (and 2x-5x faster than the generic reference sFFT) was developed in ETH and released under GPL license, as part of the Spiral open-source project. The basic idea of sFFT is that in any naturally-occurring signal, there are only a limited number of principle frequencies, that together account for 90%+ of the information. By analyzing only those key frequencies, we can significantly speed up data processing while sacrificing just a minimal amount of data. The result is that sFFT is significantly faster than FFTW for a wide range of natural signal sources: sound, images/video etc. Users who find that a large portion of their program time is spend on FFT analysis of natural signals, may benefit from directly utilizing sFFT rather than MATLAB’s builtin functions. sFFT is most effective for repeated FFTs having the same window length, since in this case the up-front overheads are amortized over multiple calls.

As a related suggestion, we could pass our data through a frequency filter (low-pass, high-pass, or band-pass filter), to remove noise due to irrelevant frequencies, before processing the data. Once we remove irrelevant frequencies, processing becomes both faster and more accurate.

## Recent Comments