OTFFT -- FFT library using AVX that is faster than FFTW

[OTFFT -- FFT library using AVX that is faster than FFTW]

In the page of "Introduction to the Stockham FFT", I led the Stockham algorithm that is a high-speed Fast Fourier Transform (FFT) algorithm. Therefore, in order to verify its performance, I tried to make a FFT library using the Stockham algorithm. The name is OTFFT.

OTFFT is a high-speed FFT library using Stockham algorithm and Intel AVX and Six-Step FFT algorithm and C++ template meta-programming. In addition, OTFFT is a parallel FFT using OpenMP. And, OTFFT has the interface for C++ and C99 and Fortran2003. OTFFT is faster than FFTW at sequence length \(2^1,2^2,...,2^{24}\).

The download page for the source code is here. The license for OTFFT is the MIT license.

I did the following benchmarks.

Multi-Thread Benchmark
Single-Thread Benchmark

First of all, I show the multi-thread benchmark. Conditions of the benchmark are as follows.

Machine:	Surface Laptop 5 13.5 inch
OS:	Ubuntu 22.04.3 on WSL2
Compiler:	GCC 11.4.0
OpenMP Runtime:	libgomp (GCC default)
cores/threads:	10 cores/12 threads
OTFFT Version:	11.6
FFTW Version:	3.3.10
FFTW Mode:	FFTW_ESTIMATE (AVX2 enabled)
Floating point precision:	double

Multi-Thread Benchmark is as follows.

------+-----------+-----------------+-----------------+-----------------+---
length|FFTW3[usec]|   OOURA   [usec]| SimpleFFT [usec]|   OTFFT   [usec]|err
------+-----------+-----------------+-----------------+-----------------+---
2^( 1)|       0.02|       0.01( 68%)|       0.02( 85%)|       0.01( 39%)| -
2^( 2)|       0.02|       0.02( 77%)|       0.03(109%)|       0.01( 49%)| -
2^( 3)|       0.03|       0.04(108%)|       0.05(159%)|       0.02( 56%)|-28
2^( 4)|       0.05|       0.09(178%)|       0.10(205%)|       0.03( 66%)|-27
2^( 5)|       0.08|       0.18(226%)|       0.21(262%)|       0.06( 76%)|-27
2^( 6)|       0.14|       0.43(303%)|       0.55(384%)|       0.12( 86%)|-26
2^( 7)|       0.30|       0.94(311%)|       1.17(387%)|       0.27( 91%)|-25
2^( 8)|       0.64|       2.11(328%)|       2.77(431%)|       0.64( 99%)|-24
2^( 9)|       1.55|       4.73(306%)|       5.96(386%)|       1.41( 91%)|-23
2^(10)|       4.07|      10.35(255%)|      13.52(332%)|       3.31( 81%)|-22
2^(11)|       8.41|      22.90(272%)|      28.45(338%)|       7.47( 89%)|-22
2^(12)|      22.22|      50.46(227%)|      72.21(325%)|      20.53( 92%)|-21
2^(13)|      49.05|     291.59(594%)|     173.81(354%)|      41.77( 85%)|-21
2^(14)|     104.65|     379.71(363%)|     395.46(378%)|      76.19( 73%)|-20
2^(15)|     275.93|     525.62(190%)|     565.19(205%)|     136.22( 49%)|-19
2^(16)|    1199.67|     989.61( 82%)|    1074.55( 90%)|     279.28( 23%)|-18
2^(17)|    1776.86|    2716.19(153%)|    2405.51(135%)|     677.31( 38%)|-18
2^(18)|    3050.84|    7313.65(240%)|    5134.60(168%)|    1323.54( 43%)|-17
2^(19)|    6746.79|   10752.24(159%)|   12999.56(193%)|    3093.32( 46%)|-16
2^(20)|   13798.07|   20286.32(147%)|   27882.82(202%)|   11011.15( 80%)|-16
2^(21)|   25776.60|   51000.31(198%)|   76165.69(295%)|   22345.99( 87%)|-15
2^(22)|   54350.75|  104982.50(193%)|  177285.93(326%)|   50237.75( 92%)|-14
2^(23)|  117092.40|  225383.00(192%)|  388589.37(332%)|  100327.30( 86%)|-14
2^(24)|  247933.53|  464576.71(187%)|  927083.27(374%)|  206509.40( 83%)|-13
------+-----------+-----------------+-----------------+-----------------+---
 cost |  134169.67|  259137.43(193%)|  322765.86(241%)|   86923.47( 65%)|
------+-----------+-----------------+-----------------+-----------------+---

This benchmark represents the sum of the processing time of DFT and IDFT. The unit of time is microseconds. Column of "FFTW3" is the processing time of FFTW Vestion 3.3.10. Column of "OOURA" is the processing time of OOURA's FFT Package. Column of "SimpleFFT" is the processing time of Cooley-Tukey algorithm. Column of "OTFFT" is the processing time of OTFFT. Numbers in the parentheses are the percentage against processing time of "FFTW3". Row of "Cost" is the average of the speed measure. Smaller is faster.

Next, I show the single-thread benchmark. Conditions of the benchmark are as follows.

Machine:	MacBook Pro 15 inch Late 2016 (Core i7 2.6GHz)
OS:	macOS Mojave
Compiler:	Clang 8.0
cores/threads:	4 cores/1 thread
OTFFT Version:	11.5
FFTW Version:	3.3.8
FFTW Mode:	FFTW_MEASURE (AVX2 enabled)
Floating point precision:	double

Single-Thread Benchmark is as follows.

------+-----------+-----------------+-----------------+-----------------+---
length|FFTW3[usec]|   OOURA   [usec]| SimpleFFT [usec]|   OTFFT   [usec]|err
------+-----------+-----------------+-----------------+-----------------+---
2^( 1)|       0.02|       0.01( 61%)|       0.01( 62%)|       0.01( 40%)| -
2^( 2)|       0.03|       0.02( 77%)|       0.04(138%)|       0.01( 44%)| -
2^( 3)|       0.03|       0.04(104%)|       0.06(172%)|       0.02( 59%)|-28
2^( 4)|       0.06|       0.07(117%)|       0.13(233%)|       0.03( 58%)|-27
2^( 5)|       0.12|       0.23(199%)|       0.26(223%)|       0.07( 62%)|-26
2^( 6)|       0.22|       0.53(242%)|       0.64(292%)|       0.15( 67%)|-26
2^( 7)|       0.44|       1.27(288%)|       1.31(298%)|       0.37( 84%)|-25
2^( 8)|       1.18|       2.79(236%)|       3.19(270%)|       0.76( 64%)|-24
2^( 9)|       2.24|       6.49(290%)|       6.59(294%)|       1.73( 77%)|-23
2^(10)|       5.39|      14.20(263%)|      15.56(289%)|       4.10( 76%)|-23
2^(11)|      12.35|      32.51(263%)|      34.64(280%)|      10.73( 87%)|-22
2^(12)|      30.44|      70.45(231%)|      96.67(318%)|      27.51( 90%)|-21
2^(13)|      71.77|     158.31(221%)|     223.45(311%)|      60.48( 84%)|-21
2^(14)|     170.92|     339.09(198%)|     505.11(296%)|     142.77( 84%)|-20
2^(15)|     414.19|     761.44(184%)|    1184.86(286%)|     310.04( 75%)|-19
2^(16)|     813.94|    1635.32(201%)|    2718.50(334%)|     814.79(100%)|-19
2^(17)|    2092.09|    3733.19(178%)|    5830.90(279%)|    2230.98(107%)|-18
2^(18)|    7189.92|    8010.17(111%)|   18604.19(259%)|    6762.49( 94%)|-17
2^(19)|   19027.48|   19948.71(105%)|   56535.51(297%)|   15269.63( 80%)|-16
2^(20)|   39730.27|   49774.56(125%)|  154435.58(389%)|   33916.37( 85%)|-16
2^(21)|   79578.31|  105574.35(133%)|  364186.08(458%)|   74751.57( 94%)|-16
2^(22)|  179210.13|  224649.25(125%)|  767693.42(428%)|  158461.42( 88%)|-14
2^(23)|  370988.64|  464596.93(125%)| 1695200.08(457%)|  315307.83( 85%)|-14
2^(24)|  797424.80|  989652.20(124%)| 3728332.46(468%)|  655610.20( 82%)|-13
------+-----------+-----------------+-----------------+-----------------+---
 cost |  273401.52|  390609.18(143%)|  977864.93(358%)|  229296.61( 84%)|
------+-----------+-----------------+-----------------+-----------------+---

Because the processing is a single-thread, there is no influence of parallel processing. Stockham algorithm is best match to Intel AVX. As a result, OTFFT becomes superior to FFTW.