## [Optimization using the Six-Step FFT and Eight-Step FFT]

In this page, I explain a optimization of FFT using the Six-Step FFT algorithm.

First, we will define discrete Fourier transform (DFT). DFT \(F_N\) is defined as follows when we assume that \(x_p\) is an input sequence and assume that \(X_k\) is an output sequence and assume that the size is \(N = 2^L\), \(L=0,1,2,3,\ldots\)

Now, we express \(x_p\) with the array \(x[p]\), and if \(N\) can be factored like \(N = nm\), it can two-dimensionaly be expressed by using \(p_1\) and \(p_2\) as follows.

Here, \(p_1 = 0,1,\ldots,n-1\) and \(p_2 = 0,1,\ldots,m-1\).

Similarly, when we express \(X_k\) with the array \(X[k]\), it can two-dimensionaly be expressed by using \(k_1\) and \(k_2\) as follows.

Here, \(k_1 = 0,1,\ldots,m-1\) and \(k_2 = 0,1,\ldots,n-1\).

Now, if we transform \(F_N\) by using the following relations,

\(F_N\) can be transformed into a dual discrete Fourier transform \(G_N\) as follows.

Next, let's define a few terms. If we fix the \(p_2\) of \(x[p_1 + p_2n]\) to \(p_2 = q\), \(x\) is regarded as the array that has only the index \(p_1\). We call this array the \(q\)-line of \(x\). In addition, \(y\) which satisfies the next relation: \(x[p_1 + p_2n] = y[p_2 + p_1m]\) is called the transposed array of \(x\). And, to get \(y\) is called "transpose \(x\)". \(G_N\) is calculated with the following steps if we use these terms.

STEP-1: | transpose \(x\) | \(x[p_1 + p_2n] \longrightarrow a[p_2 + p_1m]\) |

STEP-2: | FFT all \(p_1\)-line of \(a\) by \(F_m\) | \(a[p_2 + p_1m] \longrightarrow b[k_1 + p_1m]\) |

STEP-3: | multiply twiddle factor \(W_N^{k_1p_1}\) | \(b[k_1 + p_1m] \longrightarrow b[k_1 + p_1m]W_N^{k_1p_1} = c[k_1 + p_1m]\) |

STEP-4: | transpose \(c\) | \(c[k_1 + p_1m] \longrightarrow d[p_1 + k_1n]\) |

STEP-5: | FFT all \(k_1\)-line of \(d\) by \(F_n\) | \(d[p_1 + k_1n] \longrightarrow e[k_2 + k_1n]\) |

STEP-6: | transpose \(e\) | \(e[k_2 + k_1n] \longrightarrow X[k_1 + k_2m]\) |

The reason why transposition is necessary is because we are assuming that the subroutines \(F_n, F_m\) cannot transform the input sequence well when its stride is not 1. In this way, because we can execute FFT with six steps, this algorithm is called the Six-Step FFT. Of course, \(F_n\) and \(F_m\) are implemented with FFT.

Even if we use the Six-Step FFT, the complexity is not changed from normal FFT. Well then, what is the advantage of the Six-Step FFT? In the case of normal FFT, the locality of memory access is decreased when the sequence length becomes larger. However, in the case of the Six-Step FFT, the locality of memory access is improved because it decomposes a large sequence to small sequences and executes each small FFT. Therefore, the performance of the Six-Step FFT is improved at large sequence.

Implementation of the Six-Step FFT is as follows.

#include <complex> #include <cmath> typedef std::complex<double> complex_t; void fft0(int n, int s, bool eo, complex_t* x, complex_t* y) // n : sequence length // s : stride // eo : x is output if eo == 0, y is output if eo == 1 // x : input sequence(or output sequence if eo == 0) // y : work area(or output sequence if eo == 1) { const int m = n/2; const double theta0 = 2*M_PI/n; if (n == 2) { complex_t* z = eo ? y : x; for (int q = 0; q < s; q++) { const complex_t a = x[q + 0]; const complex_t b = x[q + s]; z[q + 0] = a + b; z[q + s] = a - b; } } else if (n >= 4) { for (int p = 0; p < m; p++) { const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0)); for (int q = 0; q < s; q++) { const complex_t a = x[q + s*(p + 0)]; const complex_t b = x[q + s*(p + m)]; y[q + s*(2*p + 0)] = a + b; y[q + s*(2*p + 1)] = (a - b) * wp; } } fft0(n/2, 2*s, !eo, y, x); } } void sixstep_fft(int log_N, complex_t* x, complex_t* y) // log_N : logarithm of the sequence length(an even number of 2 or more) // x : input/output sequence // y : work area { const int N = 1 << log_N; const int n = 1 << (log_N/2); // N == n*n for (int k = 0; k < n; k++) { // transpose x for (int p = k+1; p < n; p++) { const complex_t a = x[p+k*n]; const complex_t b = x[k+p*n]; x[p+k*n] = b; x[k+p*n] = a; } } for (int p = 0; p < n; p++) // FFT all p-line of x fft0(n, 1, 0, x + p*n, y + p*n); for (int p = 0; p < n; p++) { // multiply twiddle factor and transpose x const double theta0 = 2*p*M_PI/N; for (int k = p; k < n; k++) { const double theta = k*theta0; const complex_t wkp = complex_t(cos(theta), -sin(theta)); if (k == p) x[p+p*n] *= wkp; else { const complex_t a = x[k+p*n] * wkp; const complex_t b = x[p+k*n] * wkp; x[k+p*n] = b; x[p+k*n] = a; } } } for (int k = 0; k < n; k++) // FFT all k-line of x fft0(n, 1, 0, x + k*n, y + k*n); for (int k = 0; k < n; k++) { // transpose x for (int p = k+1; p < n; p++) { const complex_t a = x[p+k*n]; const complex_t b = x[k+p*n]; x[p+k*n] = b; x[k+p*n] = a; } } } void eightstep_fft(int log_n, complex_t* x, complex_t* y) // log_n : logarithm of the sequence length(an odd number of 3 or more) // x : input/output sequence // y : work area { const int n = 1 << log_n; const int m = n/2; const double theta0 = 2*M_PI/n; for (int p = 0; p < m; p++) { const double theta = p*theta0; const complex_t wp = complex_t(cos(theta), -sin(theta)); const complex_t a = x[p+0]; const complex_t b = x[p+m]; y[p+0] = a + b; y[p+m] = (a - b) * wp; } sixstep_fft(log_n-1, y + 0, x + 0); sixstep_fft(log_n-1, y + m, x + m); for (int p = 0; p < m; p++) { x[2*p+0] = y[p+0]; x[2*p+1] = y[p+m]; } } void fft(int n, complex_t* x) // Fourier transform // n : sequence length // x : input/output sequence { int log_n = 0; for (int i = n; i > 1; i >>= 1) log_n++; complex_t* y = new complex_t[n]; if (n <= 1) {} else if (n == 2) fft0(n, 1, 0, x, y); else if (!(log_n & 1)) sixstep_fft(log_n, x, y); else eightstep_fft(log_n, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] /= n; } void ifft(int n, complex_t* x) // Inverse Fourier transform // n : sequence length // x : input/output sequence { for (int p = 0; p < n; p++) x[p] = conj(x[p]); int log_n = 0; for (int i = n; i > 1; i >>= 1) log_n++; complex_t* y = new complex_t[n]; if (n <= 1) {} else if (n == 2) fft0(n, 1, 0, x, y); else if (!(log_n & 1)) sixstep_fft(log_n, x, y); else eightstep_fft(log_n, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] = conj(x[k]); }

Above, sequence length \(N\) is assumed to be factored like \(N = nm\).
In this condition, when \(n = m\),
we can execute the transposition extremely effectively.
And, the number of reads of the coefficient \(W_N\) is halved of usual.
Hence, in the subroutine `sixstep_fft()`

of this code,
we are assuming that \(n = m\).

If \(n \neq m\), it can be decomposed by applying the Six-Step FFT algorithm
under this assumption: \(n = \frac{N}{2}, m = 2\).
Then, because \(n = uv\) and \(u = v\),
we can execute \(F_n\) with the subroutine `sixstep_fft()`

.
I call this algorithm the Eight-Step FFT algorithm.

By the way, I made the FFT library using the algorithm which we led here. If you are interested, please to this page.