[OTFFT: High Speed FFT library]
command
top page
init page
font spec
font small
font large
toc on
toc off
print
cancel
FontName/Size/Weight

[OTFFT -- FFT library using AVX that is faster than FFTW]

In the page of "Introduction to the Stockham FFT", I led the Stockham algorithm that is a high-speed Fast Fourier Transform (FFT) algorithm. Therefore, in order to verify its performance, I tried to make a FFT library using the Stockham algorithm. The name is OTFFT.

OTFFT is a high-speed FFT library using Stockham algorithm and Intel AVX and Six-Step FFT algorithm and C++ template meta-programming. In addition, OTFFT is a parallel FFT using OpenMP. And, OTFFT has the interface for C++ and C99 and Fortran2003. OTFFT is faster than FFTW at sequence length \(2^1,2^2,...,2^{24}\).

The download page for the source code is here. The license for OTFFT is the MIT license.

I did the following benchmarks.

  1. Multi-Thread Benchmark
  2. Single-Thread Benchmark

First of all, I show the multi-thread benchmark. Conditions of the benchmark are as follows.

Machine:Amazon EC2 m5.2xlarge
OS:Ubuntu 18.04
Compiler:GCC 7.4
OpenMP Runtime:libgomp (GCC default)
cores/threads:8 cores/8 threads
FFTW Version:3.3.8
FFTW Mode:FFTW_MEASURE (AVX-512 enabled)
Floating point precision:double

Multi-Thread Benchmark is as follows.

------+-----------+-----------------+-----------------+-----------------+---
length|FFTW3[usec]|   OOURA   [usec]| SimpleFFT [usec]|   OTFFT   [usec]|err
------+-----------+-----------------+-----------------+-----------------+---
2^( 1)|       0.02|       0.02( 83%)|       0.02( 97%)|       0.01( 65%)| -
2^( 2)|       0.03|       0.03(115%)|       0.04(169%)|       0.02( 65%)| -
2^( 3)|       0.04|       0.05(137%)|       0.08(236%)|       0.02( 64%)|-28
2^( 4)|       0.07|       0.11(151%)|       0.15(215%)|       0.04( 61%)|-27
2^( 5)|       0.15|       0.32(214%)|       0.35(235%)|       0.09( 59%)|-27
2^( 6)|       0.26|       0.73(277%)|       0.68(259%)|       0.19( 71%)|-26
2^( 7)|       0.50|       1.58(314%)|       1.69(335%)|       0.39( 78%)|-25
2^( 8)|       1.05|       3.45(328%)|       3.27(311%)|       0.84( 80%)|-24
2^( 9)|       2.41|       7.59(314%)|       8.35(346%)|       2.12( 88%)|-23
2^(10)|       5.49|      16.46(300%)|      15.74(287%)|       4.44( 81%)|-23
2^(11)|      12.51|      36.42(291%)|      41.95(335%)|      10.49( 84%)|-22
2^(12)|      30.93|      79.78(258%)|      96.42(312%)|      28.92( 93%)|-21
2^(13)|      66.13|     263.41(398%)|     275.81(417%)|      44.65( 68%)|-20
2^(14)|     257.42|     470.51(183%)|     494.81(192%)|      71.23( 28%)|-20
2^(15)|     375.95|     810.45(216%)|     852.00(227%)|     130.95( 35%)|-19
2^(16)|     549.83|    1611.12(293%)|    1551.18(282%)|     283.60( 52%)|-19
2^(17)|    1056.07|    3197.87(303%)|    3044.32(288%)|     628.98( 60%)|-18
2^(18)|    1945.74|    6314.46(325%)|    5297.66(272%)|    1333.40( 69%)|-17
2^(19)|    5822.68|    9853.01(169%)|   10823.18(186%)|    2523.95( 43%)|-16
2^(20)|   14376.30|   24093.52(168%)|   23898.31(166%)|    6965.61( 48%)|-16
2^(21)|   32449.85|   51240.93(158%)|   59484.48(183%)|   18572.95( 57%)|-15
2^(22)|   69312.50|  124160.37(179%)|  160015.50(231%)|   44645.45( 64%)|-15
2^(23)|  141994.60|  226413.23(159%)|  367998.46(259%)|  105419.03( 74%)|-14
2^(24)|  297037.80|  509076.86(171%)|  832253.21(280%)|  201982.33( 68%)|-14
------+-----------+-----------------+-----------------+-----------------+---
 cost |  150344.96|  316015.09(210%)|  363115.84(242%)|   92014.39( 61%)|
------+-----------+-----------------+-----------------+-----------------+---

This benchmark represents the sum of the processing time of DFT and IDFT. The unit of time is microseconds. Column of "FFTW3" is the processing time of FFTW Vestion 3.3.8. Column of "OOURA" is the processing time of OOURA's FFT Package. Column of "SimpleFFT" is the processing time of Cooley-Tukey algorithm. Column of "OTFFT" is the processing time of OTFFT. Numbers in the parentheses are the percentage against processing time of "FFTW3". Row of "Cost" is the average of the speed measure. Smaller is faster.


Next, I show the single-thread benchmark. Conditions of the benchmark are as follows.

Machine:MacBook Pro 15 inch Late 2016 (Core i7 2.6GHz)
OS:macOS Mojave
Compiler:Clang 8.0
cores/threads:4 cores/1 thread
FFTW Version:3.3.8
FFTW Mode:FFTW_MEASURE (AVX2 enabled)
Floating point precision:double

Single-Thread Benchmark is as follows.

------+-----------+-----------------+-----------------+-----------------+---
length|FFTW3[usec]|   OOURA   [usec]| SimpleFFT [usec]|   OTFFT   [usec]|err
------+-----------+-----------------+-----------------+-----------------+---
2^( 1)|       0.02|       0.01( 61%)|       0.01( 62%)|       0.01( 40%)| -
2^( 2)|       0.03|       0.02( 77%)|       0.04(138%)|       0.01( 44%)| -
2^( 3)|       0.03|       0.04(104%)|       0.06(172%)|       0.02( 59%)|-28
2^( 4)|       0.06|       0.07(117%)|       0.13(233%)|       0.03( 58%)|-27
2^( 5)|       0.12|       0.23(199%)|       0.26(223%)|       0.07( 62%)|-26
2^( 6)|       0.22|       0.53(242%)|       0.64(292%)|       0.15( 67%)|-26
2^( 7)|       0.44|       1.27(288%)|       1.31(298%)|       0.37( 84%)|-25
2^( 8)|       1.18|       2.79(236%)|       3.19(270%)|       0.76( 64%)|-24
2^( 9)|       2.24|       6.49(290%)|       6.59(294%)|       1.73( 77%)|-23
2^(10)|       5.39|      14.20(263%)|      15.56(289%)|       4.10( 76%)|-23
2^(11)|      12.35|      32.51(263%)|      34.64(280%)|      10.73( 87%)|-22
2^(12)|      30.44|      70.45(231%)|      96.67(318%)|      27.51( 90%)|-21
2^(13)|      71.77|     158.31(221%)|     223.45(311%)|      60.48( 84%)|-21
2^(14)|     170.92|     339.09(198%)|     505.11(296%)|     142.77( 84%)|-20
2^(15)|     414.19|     761.44(184%)|    1184.86(286%)|     310.04( 75%)|-19
2^(16)|     813.94|    1635.32(201%)|    2718.50(334%)|     814.79(100%)|-19
2^(17)|    2092.09|    3733.19(178%)|    5830.90(279%)|    2230.98(107%)|-18
2^(18)|    7189.92|    8010.17(111%)|   18604.19(259%)|    6762.49( 94%)|-17
2^(19)|   19027.48|   19948.71(105%)|   56535.51(297%)|   15269.63( 80%)|-16
2^(20)|   39730.27|   49774.56(125%)|  154435.58(389%)|   33916.37( 85%)|-16
2^(21)|   79578.31|  105574.35(133%)|  364186.08(458%)|   74751.57( 94%)|-16
2^(22)|  179210.13|  224649.25(125%)|  767693.42(428%)|  158461.42( 88%)|-14
2^(23)|  370988.64|  464596.93(125%)| 1695200.08(457%)|  315307.83( 85%)|-14
2^(24)|  797424.80|  989652.20(124%)| 3728332.46(468%)|  655610.20( 82%)|-13
------+-----------+-----------------+-----------------+-----------------+---
 cost |  273401.52|  390609.18(143%)|  977864.93(358%)|  229296.61( 84%)|
------+-----------+-----------------+-----------------+-----------------+---

Because the processing is a single-thread, there is no influence of parallel processing. Stockham algorithm is best match to Intel AVX. As a result, OTFFT becomes superior to FFTW.