Steve on Image Processing

Timing the FFT 3

Posted by Steve Eddins,

I've seen two questions recently about the speed of the fft function in MATLAB. First, a tech support question was forwarded to development. The user wanted to know how to predict the computation time for an FFT of a given length, N. This user was interested in values of N in the neighborhood of 4096 (2^12).

The second was a post in the MATLAB newsgroup comp.soft-sys.matlab. This user wondered why padding to the next power of two wasn't always the fastest way to compute the FFT.

Inspired by these questions, I want to show you today how to do some FFT benchmarking in MATLAB.

It turns out that, in general, the time required to execute an N-point FFT is proportional to N*log(N). For any particular value of N, though, the execution time can be hard to predict and depends on the number of prime factors of N (very roughly speaking). The variation in time between two close values of N can be as much as an order of magnitude.

Whenever I do FFT benchmarking, I find it very helpful to look at three sets of numbers:

  • Powers of 2
  • Composite numbers that are not powers of 2
  • Prime numbers

Also, I have learned to look at plots that are log scale in N and that have roughly the same number of test values within each octave (or doubling) of N.

Constructing sets of N values along these lines takes a little thought. Here's some code.

First, how many powers of 2 do we want to examine? Based on the customer questions I saw, I want to examine the range from 1024 (2^10) to 131072 (2^17).

low2 = 10;
high2 = 17;
n2 = 2.^(low2:high2);

Next, I want to pick 10 composite numbers and 10 prime numbers in between successive powers of 2. I'd like to pick the numbers "randomly," but I also want my experiment to be repeatable. To satisfy these seemingly contradictory constraints, I'll reset the MATLAB random number generator before beginning.

rng('default');

% Initialize the vectors holding the prime N's and composite N's.
np = [];
nc = [];

for m = low2:high2
    k = (2^m):(2^(m+1));
    kp = k(2:end-1);
    isp = isprime(kp);
    primes = kp(isp);
    composites = kp(~isp);

    % Use randperm to pick out 10 values from the vector of primes and 10
    % values from the vector of composites.
    new_np = primes(randperm(length(primes),10));
    new_nc = composites(randperm(length(composites),10));

    np = [np new_np];
    nc = [nc new_nc];
end

Now let's use the function timeit to measure the execution time required to compute FFTs for all these values of N. (If you don't have a recent version of MATLAB that has timeit, you can get a version of it from the File Exchange.)

t2 = zeros(size(n2));
for k = 1:length(n2)
    x = rand(n2(k),1);
    t2(k) = timeit(@() fft(x));
end
tp = zeros(size(np));
for k = 1:length(np)
    x = rand(np(k),1);
    tp(k) = timeit(@() fft(x));
end
tc = zeros(size(np));
for k = 1:length(nc)
    x = rand(nc(k),1);
    tc(k) = timeit(@() fft(x));
end

Now do a loglog plot of all these times.

loglog(n2,t2,'o')
set(gca,'xtick',2.^(10:17))
xlim([2^10 2^17])

hold on

loglog(nc,tc,'+')
loglog(np,tp,'*')

hold off

legend({'Powers of 2','Composite numbers','Prime numbers'}, ...
    'Location','NorthWest')
xlabel('N')
ylabel('Execution time (s)')
title('FFT execution time as a function of N')

You can see that there's a wide spread of execution times for the values of N that are not powers of 2.

One thing I'm not seeing is what the MATLAB Newsgroup poster reported. That is, I don't see a non-power-of-2 time that's faster than the next highest power of 2.

So let's look a little harder for composite numbers that are faster than what we've seen so far. Specifically, I'm going to look for values of N with prime factors no bigger than 3.

nn = [];
for m = low2:high2
    k = (2^m):(2^(m+1));
    kp = k(2:end-1);

    kp = kp(randperm(length(kp)));
    nn_m = [];
    for q = 1:length(kp)
        if max(factor(kp(q))) <= 3
            nn_m = [nn_m kp(q)];

            if length(nn_m) >= 4
                % We've found enough in this part of the range.
                break
            end
        end
    end

    nn = [nn nn_m];
end

Measure execution times for these "highly composite" numbers.

tn = zeros(length(nn),1);
for k = 1:length(nn)
    x = rand(nn(k),1);
    tn(k) = timeit(@() fft(x));
end

Add the times to the plot.

hold on
loglog(nn,tn,'d')
hold off

legend({'Powers of 2','Composite numbers','Prime numbers', ...
    'Highly composite numbers'},'Location','NorthWest')

You can see that sometimes a non-power-of-2 can be computed very fast, faster than the next higher power of 2. In this experiment we found one such value of N: 39366. This number has 10 prime factors:

factor(39366)
ans =

     2     3     3     3     3     3     3     3     3     3

I hope you enjoyed these experiments with FFT benchmarking. I can tell you from personal experience that it can turn into almost a full-time hobby!


Get the MATLAB code

Published with MATLAB® R2014a

3 CommentsOldest to Newest

Very thorough and informative, thank you! I learned from experience that the “next power of 2″ trick often doesn’t help, especially with fft2 (not fft). The fact is that the additional number of elements under consideration can trump the benefits of low prime factors.

This is even more true for the 2D case, since it is done essentially by running an fft for each column and afterwards for each row. In the N-D case you can easily realize that the complexity grows exponentially with the dimension size (which is expected, since the algorithm must fill an exponentially large number of elements).

I have taken the liberty to expand on your script to do the following:
1. Run the FFTW planner
2. Compare performance to peak FLOPs of your cpu (assuming the arithmetic count of split-radix)

function fft_speed()
optimize = 0;%1
if optimize
    fftw('planner', 'measure');
else
    fftw('planner', 'estimate');
    load str
    fftw('wisdom', str);
end

low2 = 7;
high2 = 24;
n2 = 2.^(low2:high2);
rng('default');
% Initialize the vectors holding the prime N's and composite N's.
np = [];
nc = [];
for m = low2:high2
    k = (2^m):(2^(m+1));
    kp = k(2:end-1);
    isp = isprime(kp);
    primes = kp(isp);
    composites = kp(~isp);

    % Use randperm to pick out 10 values from the vector of primes and 10
    % values from the vector of composites.
    new_np = primes(randperm(length(primes),10));
    new_nc = composites(randperm(length(composites),10));

    np = [np new_np];
    nc = [nc new_nc];
end

t2 = zeros(size(n2));
for k = 1:length(n2)
    x = rand(n2(k),1);
    if optimize
        fft(x);
    else
        t2(k) = timeit(@() fft(x));
    end
end

tp = zeros(size(np));
for k = 1:length(np)
    x = rand(np(k),1);
    if optimize
        fft(x);
    else
        tp(k) = timeit(@() fft(x));
    end
end

tc = zeros(size(np));
for k = 1:length(nc)
    x = rand(nc(k),1);
    if optimize
        fft(x);
    else
        tc(k) = timeit(@() fft(x));
    end
end

nn = [];
for m = low2:high2
    k = (2^m):(2^(m+1));
    kp = k(2:end-1);

    kp = kp(randperm(length(kp)));
    nn_m = [];
    for q = 1:length(kp)
        if max(factor(kp(q))) = 4
                % We've found enough in this part of the range.
                break
            end
        end
    end

    nn = [nn nn_m];
end

tn = zeros(length(nn),1);
for k = 1:length(nn)
    x = rand(nn(k),1);
    if optimize
        fft(x);
    else
        tn(k) = timeit(@() fft(x));
    end
end

[~,i] = sort(n2);
n2 = n2(i);
t2 = t2(i);
[~,i] = sort(nc);
nc = nc(i);
tc = tc(i);
[~,i] = sort(np);
np = np(i);
tp = tp(i);
[~,i] = sort(nn);
nn = nn(i);
tn = tn(i);

figure(1)
loglog(n2,t2,'o:')
set(gca,'xtick',2.^(low2:high2))
xlim([2^low2 2^high2])
hold all
loglog(nc,tc,'+:')
loglog(np,tp,'*:')
loglog(nn,tn,'d:')
hold off
xlabel('N')
ylabel('Execution time (s)')
title('FFT execution time as a function of N')
legend({'Powers of 2','Composite numbers','Prime numbers', ...
    'Highly composite numbers'},'Location','NorthWest')

%% fill with specific info about cpu
cache = [[32 256]*1e3 [6 128]*1e6];%l1,l2,... cache in byte
cpu_clock = 2.3e9;%3.9e9;%cpu clock in Hz
simd_width = [4 8];%doubles per register
fusing = 2;%2 if fused multiply-add available
latency = 1;
simd_units = 2;
cores = 4;
haswell_core_flop_peak = cpu_clock*simd_units*simd_width*fusing/latency;
disp([num2str(haswell_core_flop_peak*1e-9), ' TFLOP (DP/SP) (per core theoretical peak)']);
haswell_cpu_flop_peak = cores*cpu_clock*simd_units*simd_width*fusing/latency;
disp([num2str(haswell_cpu_flop_peak*1e-9), ' TFLOP (DP/SP) (cpu-wide theoretical peak)']);

figure(2)
clf
semilogx(n2,(4*n2.*log2(n2)-6*n2+8)./t2,'o:')
set(gca,'xtick',2.^(low2:high2))
xlim([2^low2 2^high2])
hold all
%http://en.wikipedia.org/wiki/Split-radix_FFT_algorithm
semilogx(nc,(4*nc.*log2(nc)-6*nc+8)./tc,'+:')
semilogx(np,(4*np.*log2(np)-6*np+8)./tp,'*:')
semilogx(nn,(4*nn.*log2(nn)-6*nn+8)./tn','d:')
semilogx(n2,haswell_core_flop_peak(1)*ones(size(n2)),':')
for c = 1:length(cache)
    semilogx((cache(c)/(2*8))*ones(2,1),[0 haswell_core_flop_peak(1)],'k-.')
end
hold off
legend({'Powers of 2','Composite numbers','Prime numbers', 'Highly composite numbers', 'cpu theoretical peak', 'cache boundaries in complex doubles'}, ...
    'Location','NorthWest')
xlabel('N')
ylabel('DP FLOPs')
title('FFT single-threaded dFLOPS (assumed split-radix) as a function of N')

if optimize
    str = fftw('wisdom');
    save str
end

This was discussed 3 years ago in CSSM thread 304045. The conclusion was that in many cases, transform lengths of repeated small prime factors are faster than the next higher power-of-2 (for example, 81=3^4 would be faster than 128=2^7). The specific comparison between potential window sizes should be done on the target computer, since it is highly affected by hardware issues such as cache size and memory boundaries/alignment.

An interesting theoretical breakthrough in FFT performance was announced by MIT researchers in 2012. The resulting library, sFFT (sparse FFT), is available under an MIT open-source license. A version of sFFT that is hardware optimized (and 2x-5x faster than the generic reference sFFT) was developed in ETH and released under GPL license, as part of the Spiral open-source project. The basic idea of sFFT is that in any naturally-occurring signal, there are only a limited number of principle frequencies, that together account for 90%+ of the information. By analyzing only those key frequencies, we can significantly speed up data processing while sacrificing just a minimal amount of data. The result is that sFFT is significantly faster than FFTW for a wide range of natural signal sources: sound, images/video etc. Users who find that a large portion of their program time is spend on FFT analysis of natural signals, may benefit from directly utilizing sFFT rather than MATLAB’s builtin functions. sFFT is most effective for repeated FFTs having the same window length, since in this case the up-front overheads are amortized over multiple calls.

As a related suggestion, we could pass our data through a frequency filter (low-pass, high-pass, or band-pass filter), to remove noise due to irrelevant frequencies, before processing the data. Once we remove irrelevant frequencies, processing becomes both faster and more accurate.

These postings are the author's and don't necessarily represent the opinions of MathWorks.