[OTFFT: High Speed FFT library]
command
top page
init page
font spec
font small
font large
toc on
toc off
print
cancel
FontName/Size/Weight

## [Optimization using the Six-Step FFT and Eight-Step FFT]

In this page, I explain a optimization of FFT using the Six-Step FFT algorithm.

First, we will define discrete Fourier transform (DFT). DFT $$F_N$$ is defined as follows when we assume that $$x_p$$ is an input sequence and assume that $$X_k$$ is an output sequence and assume that the size is $$N = 2^L$$, $$L=0,1,2,3,\ldots$$

$X_k = F_N(x_p) = \frac{1}{N}\sum_{p=0}^{N-1}x_p W_N^{kp},~W_N = \exp\left(-j\frac{2\pi}{N}\right),~j = \sqrt{-1}$ $k = 0,1,2,3,\ldots,N-1$

Now, we express $$x_p$$ with the array $$x[p]$$, and if $$N$$ can be factored like $$N = nm$$, it can two-dimensionaly be expressed by using $$p_1$$ and $$p_2$$ as follows.

$x_p = x[p_1 + p_2n],~p = p_1 + p_2n$

Here, $$p_1 = 0,1,\ldots,n-1$$ and $$p_2 = 0,1,\ldots,m-1$$.

Similarly, when we express $$X_k$$ with the array $$X[k]$$, it can two-dimensionaly be expressed by using $$k_1$$ and $$k_2$$ as follows.

$X_k = X[k_1 + k_2m],~k = k_1 + k_2m$

Here, $$k_1 = 0,1,\ldots,m-1$$ and $$k_2 = 0,1,\ldots,n-1$$.

Now, if we transform $$F_N$$ by using the following relations,

$\begin{eqnarray} kp&=&(k_1 + k_2m)(p_1 + p_2n) = k_1p_1 + k_1p_2n + k_2p_1m + k_2p_2N \\ W_N^{kp}&=&W_N^{k_1p_1 + k_1p_2n + k_2p_1m + k_2p_2N} \\ W_N^{k_1p_2n}&=&W_m^{k_1p_2},~W_N^{k_2p_1m} = W_n^{k_2p_1},~W_N^{k_2p_2N} = 1 \end{eqnarray}$

$$F_N$$ can be transformed into a dual discrete Fourier transform $$G_N$$ as follows.

$X[k_1 + k_2m] = G_N(x[p_1 + p_2n]) = \frac{1}{n}\sum_{p_1=0}^{n-1}\left(\left(\frac{1}{m}\sum_{p_2=0}^{m-1}x[p_1 + p_2n]W_m^{k_1p_2}\right)W_N^{k_1p_1}\right)W_n^{k_2p_1}$

Next, let's define a few terms. If we fix the $$p_2$$ of $$x[p_1 + p_2n]$$ to $$p_2 = q$$, $$x$$ is regarded as the array that has only the index $$p_1$$. We call this array the $$q$$-line of $$x$$. In addition, $$y$$ which satisfies the next relation: $$x[p_1 + p_2n] = y[p_2 + p_1m]$$ is called the transposed array of $$x$$. And, to get $$y$$ is called "transpose $$x$$". $$G_N$$ is calculated with the following steps if we use these terms.

 STEP-1: transpose $$x$$ $$x[p_1 + p_2n] \longrightarrow a[p_2 + p_1m]$$ STEP-2: FFT all $$p_1$$-line of $$a$$ by $$F_m$$ $$a[p_2 + p_1m] \longrightarrow b[k_1 + p_1m]$$ STEP-3: multiply twiddle factor $$W_N^{k_1p_1}$$ $$b[k_1 + p_1m] \longrightarrow b[k_1 + p_1m]W_N^{k_1p_1} = c[k_1 + p_1m]$$ STEP-4: transpose $$c$$ $$c[k_1 + p_1m] \longrightarrow d[p_1 + k_1n]$$ STEP-5: FFT all $$k_1$$-line of $$d$$ by $$F_n$$ $$d[p_1 + k_1n] \longrightarrow e[k_2 + k_1n]$$ STEP-6: transpose $$e$$ $$e[k_2 + k_1n] \longrightarrow X[k_1 + k_2m]$$

The reason why transposition is necessary is because we are assuming that the subroutines $$F_n, F_m$$ cannot transform the input sequence well when its stride is not 1. In this way, because we can execute FFT with six steps, this algorithm is called the Six-Step FFT. Of course, $$F_n$$ and $$F_m$$ are implemented with FFT.

Even if we use the Six-Step FFT, the complexity is not changed from normal FFT. Well then, what is the advantage of the Six-Step FFT? In the case of normal FFT, the locality of memory access is decreased when the sequence length becomes larger. However, in the case of the Six-Step FFT, the locality of memory access is improved because it decomposes a large sequence to small sequences and executes each small FFT. Therefore, the performance of the Six-Step FFT is improved at large sequence.

Implementation of the Six-Step FFT is as follows.

List-11: Six-Step FFT
#include <complex>
#include <cmath>

typedef std::complex<double> complex_t;

void fft0(int n, int s, bool eo, complex_t* x, complex_t* y)
// n  : sequence length
// s  : stride
// eo : x is output if eo == 0, y is output if eo == 1
// x  : input sequence(or output sequence if eo == 0)
// y  : work area(or output sequence if eo == 1)
{
const int m = n/2;
const double theta0 = 2*M_PI/n;

if (n == 2) {
complex_t* z = eo ? y : x;
for (int q = 0; q < s; q++) {
const complex_t a = x[q + 0];
const complex_t b = x[q + s];
z[q + 0] = a + b;
z[q + s] = a - b;
}
}
else if (n >= 4) {
for (int p = 0; p < m; p++) {
const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0));
for (int q = 0; q < s; q++) {
const complex_t a = x[q + s*(p + 0)];
const complex_t b = x[q + s*(p + m)];
y[q + s*(2*p + 0)] =  a + b;
y[q + s*(2*p + 1)] = (a - b) * wp;
}
}
fft0(n/2, 2*s, !eo, y, x);
}
}

void sixstep_fft(int log_N, complex_t* x, complex_t* y)
// log_N : logarithm of the sequence length(an even number of 2 or more)
// x     : input/output sequence
// y     : work area
{
const int N = 1 << log_N;
const int n = 1 << (log_N/2); // N == n*n
for (int k = 0; k < n; k++) { // transpose x
for (int p = k+1; p < n; p++) {
const complex_t a = x[p+k*n];
const complex_t b = x[k+p*n];
x[p+k*n] = b;
x[k+p*n] = a;
}
}
for (int p = 0; p < n; p++) // FFT all p-line of x
fft0(n, 1, 0, x + p*n, y + p*n);
for (int p = 0; p < n; p++) { // multiply twiddle factor and transpose x
const double theta0 = 2*p*M_PI/N;
for (int k = p; k < n; k++) {
const double theta = k*theta0;
const complex_t wkp = complex_t(cos(theta), -sin(theta));
if (k == p)
x[p+p*n] *= wkp;
else {
const complex_t a = x[k+p*n] * wkp;
const complex_t b = x[p+k*n] * wkp;
x[k+p*n] = b;
x[p+k*n] = a;
}
}
}
for (int k = 0; k < n; k++) // FFT all k-line of x
fft0(n, 1, 0, x + k*n, y + k*n);
for (int k = 0; k < n; k++) { // transpose x
for (int p = k+1; p < n; p++) {
const complex_t a = x[p+k*n];
const complex_t b = x[k+p*n];
x[p+k*n] = b;
x[k+p*n] = a;
}
}
}

void eightstep_fft(int log_n, complex_t* x, complex_t* y)
// log_n : logarithm of the sequence length(an odd number of 3 or more)
// x     : input/output sequence
// y     : work area
{
const int n = 1 << log_n;
const int m = n/2;
const double theta0 = 2*M_PI/n;
for (int p = 0; p < m; p++) {
const double theta = p*theta0;
const complex_t wp = complex_t(cos(theta), -sin(theta));
const complex_t a = x[p+0];
const complex_t b = x[p+m];
y[p+0] =  a + b;
y[p+m] = (a - b) * wp;
}
sixstep_fft(log_n-1, y + 0, x + 0);
sixstep_fft(log_n-1, y + m, x + m);
for (int p = 0; p < m; p++) {
x[2*p+0] = y[p+0];
x[2*p+1] = y[p+m];
}
}

void fft(int n, complex_t* x) // Fourier transform
// n : sequence length
// x : input/output sequence
{
int log_n = 0;
for (int i = n; i > 1; i >>= 1) log_n++;
complex_t* y = new complex_t[n];
if (n <= 1) {}
else if (n == 2)       fft0(n, 1, 0, x, y);
else if (!(log_n & 1)) sixstep_fft(log_n, x, y);
else                   eightstep_fft(log_n, x, y);
delete[] y;
for (int k = 0; k < n; k++) x[k] /= n;
}

void ifft(int n, complex_t* x) // Inverse Fourier transform
// n : sequence length
// x : input/output sequence
{
for (int p = 0; p < n; p++) x[p] = conj(x[p]);
int log_n = 0;
for (int i = n; i > 1; i >>= 1) log_n++;
complex_t* y = new complex_t[n];
if (n <= 1) {}
else if (n == 2)       fft0(n, 1, 0, x, y);
else if (!(log_n & 1)) sixstep_fft(log_n, x, y);
else                   eightstep_fft(log_n, x, y);
delete[] y;
for (int k = 0; k < n; k++) x[k] = conj(x[k]);
}


Above, sequence length $$N$$ is assumed to be factored like $$N = nm$$. In this condition, when $$n = m$$, we can execute the transposition extremely effectively. And, the number of reads of the coefficient $$W_N$$ is halved of usual. Hence, in the subroutine sixstep_fft() of this code, we are assuming that $$n = m$$.

If $$n \neq m$$, it can be decomposed by applying the Six-Step FFT algorithm under this assumption: $$n = \frac{N}{2}, m = 2$$. Then, because $$n = uv$$ and $$u = v$$, we can execute $$F_n$$ with the subroutine sixstep_fft(). I call this algorithm the Eight-Step FFT algorithm.

By the way, I made the FFT library using the algorithm which we led here. If you are interested, please to this page.