[OTFFT: High Speed FFT library]
command
top page
init page
font spec
font small
font large
toc on
toc off
print
cancel
FontName/Size/Weight

[Optimization using the Six-Step FFT and Eight-Step FFT]

In this page, I explain a optimization of FFT using the Six-Step FFT algorithm.

First, we will define discrete Fourier transform (DFT). DFT \(F_N\) is defined as follows when we assume that \(x_p\) is an input sequence and assume that \(X_k\) is an output sequence and assume that the size is \(N = 2^L\), \(L=0,1,2,3,\ldots\)

\[ X_k = F_N(x_p) = \frac{1}{N}\sum_{p=0}^{N-1}x_p W_N^{kp},~W_N = \exp\left(-j\frac{2\pi}{N}\right),~j = \sqrt{-1} \] \[ k = 0,1,2,3,\ldots,N-1 \]

Now, we express \(x_p\) with the array \(x[p]\), and if \(N\) can be factored like \(N = nm\), it can two-dimensionaly be expressed by using \(p_1\) and \(p_2\) as follows.

\[ x_p = x[p_1 + p_2n],~p = p_1 + p_2n \]

Here, \(p_1 = 0,1,\ldots,n-1\) and \(p_2 = 0,1,\ldots,m-1\).

Similarly, when we express \(X_k\) with the array \(X[k]\), it can two-dimensionaly be expressed by using \(k_1\) and \(k_2\) as follows.

\[ X_k = X[k_1 + k_2m],~k = k_1 + k_2m \]

Here, \(k_1 = 0,1,\ldots,m-1\) and \(k_2 = 0,1,\ldots,n-1\).

Now, if we transform \(F_N\) by using the following relations,

\[ \begin{eqnarray} kp&=&(k_1 + k_2m)(p_1 + p_2n) = k_1p_1 + k_1p_2n + k_2p_1m + k_2p_2N \\ W_N^{kp}&=&W_N^{k_1p_1 + k_1p_2n + k_2p_1m + k_2p_2N} \\ W_N^{k_1p_2n}&=&W_m^{k_1p_2},~W_N^{k_2p_1m} = W_n^{k_2p_1},~W_N^{k_2p_2N} = 1 \end{eqnarray} \]

\(F_N\) can be transformed into a dual discrete Fourier transform \(G_N\) as follows.

\[ X[k_1 + k_2m] = G_N(x[p_1 + p_2n]) = \frac{1}{n}\sum_{p_1=0}^{n-1}\left(\left(\frac{1}{m}\sum_{p_2=0}^{m-1}x[p_1 + p_2n]W_m^{k_1p_2}\right)W_N^{k_1p_1}\right)W_n^{k_2p_1} \]

Next, let's define a few terms. If we fix the \(p_2\) of \(x[p_1 + p_2n]\) to \(p_2 = q\), \(x\) is regarded as the array that has only the index \(p_1\). We call this array the \(q\)-line of \(x\). In addition, \(y\) which satisfies the next relation: \(x[p_1 + p_2n] = y[p_2 + p_1m]\) is called the transposed array of \(x\). And, to get \(y\) is called "transpose \(x\)". \(G_N\) is calculated with the following steps if we use these terms.

STEP-1: transpose \(x\)\(x[p_1 + p_2n] \longrightarrow a[p_2 + p_1m]\)
STEP-2: FFT all \(p_1\)-line of \(a\) by \(F_m\) \(a[p_2 + p_1m] \longrightarrow b[k_1 + p_1m]\)
STEP-3: multiply twiddle factor \(W_N^{k_1p_1}\) \(b[k_1 + p_1m] \longrightarrow b[k_1 + p_1m]W_N^{k_1p_1} = c[k_1 + p_1m]\)
STEP-4: transpose \(c\)\(c[k_1 + p_1m] \longrightarrow d[p_1 + k_1n]\)
STEP-5: FFT all \(k_1\)-line of \(d\) by \(F_n\) \(d[p_1 + k_1n] \longrightarrow e[k_2 + k_1n]\)
STEP-6: transpose \(e\)\(e[k_2 + k_1n] \longrightarrow X[k_1 + k_2m]\)

The reason why transposition is necessary is because we are assuming that the subroutines \(F_n, F_m\) cannot transform the input sequence well when its stride is not 1. In this way, because we can execute FFT with six steps, this algorithm is called the Six-Step FFT. Of course, \(F_n\) and \(F_m\) are implemented with FFT.

Even if we use the Six-Step FFT, the complexity is not changed from normal FFT. Well then, what is the advantage of the Six-Step FFT? In the case of normal FFT, the locality of memory access is decreased when the sequence length becomes larger. However, in the case of the Six-Step FFT, the locality of memory access is improved because it decomposes a large sequence to small sequences and executes each small FFT. Therefore, the performance of the Six-Step FFT is improved at large sequence.

Implementation of the Six-Step FFT is as follows.

List-11: Six-Step FFT
#include <complex>
#include <cmath>

typedef std::complex<double> complex_t;

void fft0(int n, int s, bool eo, complex_t* x, complex_t* y)
// n  : sequence length
// s  : stride
// eo : x is output if eo == 0, y is output if eo == 1
// x  : input sequence(or output sequence if eo == 0)
// y  : work area(or output sequence if eo == 1)
{
    const int m = n/2;
    const double theta0 = 2*M_PI/n;

    if (n == 2) {
        complex_t* z = eo ? y : x;
        for (int q = 0; q < s; q++) {
            const complex_t a = x[q + 0];
            const complex_t b = x[q + s];
            z[q + 0] = a + b;
            z[q + s] = a - b;
        }
    }
    else if (n >= 4) {
        for (int p = 0; p < m; p++) {
            const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0));
            for (int q = 0; q < s; q++) {
                const complex_t a = x[q + s*(p + 0)];
                const complex_t b = x[q + s*(p + m)];
                y[q + s*(2*p + 0)] =  a + b;
                y[q + s*(2*p + 1)] = (a - b) * wp;
            }
        }
        fft0(n/2, 2*s, !eo, y, x);
    }
}

void sixstep_fft(int log_N, complex_t* x, complex_t* y)
// log_N : logarithm of the sequence length(an even number of 2 or more)
// x     : input/output sequence
// y     : work area
{
    const int N = 1 << log_N;
    const int n = 1 << (log_N/2); // N == n*n
    for (int k = 0; k < n; k++) { // transpose x
        for (int p = k+1; p < n; p++) {
            const complex_t a = x[p+k*n];
            const complex_t b = x[k+p*n];
            x[p+k*n] = b;
            x[k+p*n] = a;
        }
    }
    for (int p = 0; p < n; p++) // FFT all p-line of x
        fft0(n, 1, 0, x + p*n, y + p*n);
    for (int p = 0; p < n; p++) { // multiply twiddle factor and transpose x
        const double theta0 = 2*p*M_PI/N;
        for (int k = p; k < n; k++) {
            const double theta = k*theta0;
            const complex_t wkp = complex_t(cos(theta), -sin(theta));
            if (k == p)
                x[p+p*n] *= wkp;
            else {
                const complex_t a = x[k+p*n] * wkp;
                const complex_t b = x[p+k*n] * wkp;
                x[k+p*n] = b;
                x[p+k*n] = a;
            }
        }
    }
    for (int k = 0; k < n; k++) // FFT all k-line of x
        fft0(n, 1, 0, x + k*n, y + k*n);
    for (int k = 0; k < n; k++) { // transpose x
        for (int p = k+1; p < n; p++) {
            const complex_t a = x[p+k*n];
            const complex_t b = x[k+p*n];
            x[p+k*n] = b;
            x[k+p*n] = a;
        }
    }
}

void eightstep_fft(int log_n, complex_t* x, complex_t* y)
// log_n : logarithm of the sequence length(an odd number of 3 or more)
// x     : input/output sequence
// y     : work area
{
    const int n = 1 << log_n;
    const int m = n/2;
    const double theta0 = 2*M_PI/n;
    for (int p = 0; p < m; p++) {
        const double theta = p*theta0;
        const complex_t wp = complex_t(cos(theta), -sin(theta));
        const complex_t a = x[p+0];
        const complex_t b = x[p+m];
        y[p+0] =  a + b;
        y[p+m] = (a - b) * wp;
    }
    sixstep_fft(log_n-1, y + 0, x + 0);
    sixstep_fft(log_n-1, y + m, x + m);
    for (int p = 0; p < m; p++) {
        x[2*p+0] = y[p+0];
        x[2*p+1] = y[p+m];
    }
}

void fft(int n, complex_t* x) // Fourier transform
// n : sequence length
// x : input/output sequence
{
    int log_n = 0;
    for (int i = n; i > 1; i >>= 1) log_n++;
    complex_t* y = new complex_t[n];
    if (n <= 1) {}
    else if (n == 2)       fft0(n, 1, 0, x, y);
    else if (!(log_n & 1)) sixstep_fft(log_n, x, y);
    else                   eightstep_fft(log_n, x, y);
    delete[] y;
    for (int k = 0; k < n; k++) x[k] /= n;
}

void ifft(int n, complex_t* x) // Inverse Fourier transform
// n : sequence length
// x : input/output sequence
{
    for (int p = 0; p < n; p++) x[p] = conj(x[p]);
    int log_n = 0;
    for (int i = n; i > 1; i >>= 1) log_n++;
    complex_t* y = new complex_t[n];
    if (n <= 1) {}
    else if (n == 2)       fft0(n, 1, 0, x, y);
    else if (!(log_n & 1)) sixstep_fft(log_n, x, y);
    else                   eightstep_fft(log_n, x, y);
    delete[] y;
    for (int k = 0; k < n; k++) x[k] = conj(x[k]);
}

Above, sequence length \(N\) is assumed to be factored like \(N = nm\). In this condition, when \(n = m\), we can execute the transposition extremely effectively. And, the number of reads of the coefficient \(W_N\) is halved of usual. Hence, in the subroutine sixstep_fft() of this code, we are assuming that \(n = m\).

If \(n \neq m\), it can be decomposed by applying the Six-Step FFT algorithm under this assumption: \(n = \frac{N}{2}, m = 2\). Then, because \(n = uv\) and \(u = v\), we can execute \(F_n\) with the subroutine sixstep_fft(). I call this algorithm the Eight-Step FFT algorithm.

By the way, I made the FFT library using the algorithm which we led here. If you are interested, please to this page.