## [Iterative version of the Stockham FFT]

In the previous page, we led the Stockham algorithm from the Cooley-Tukey algorithm by simple transformation. However, this recursive algorithm does not have the advantage in speed against the Cooley-Tukey algorithm. Therefore, in this page, we further transform this algorithm and will lead high-speed iterative version of the Stockham algorithm.

First, let's examine the behavior of the
List-3.
When I illustrate the recursive-call state of
`fft0()`

and `fft1()`

,
it will be as follows.

[recursion stage-0] fft0(8,1,0,x,y) access to x[p] [recursion stage-1] fft1(4,2,0,x,y) fft1(4,2,1,x,y) access to x[2p+0] access to x[2p+1] [recursion stage-2] fft0(2,4,0,x,y) fft0(2,4,2,x,y) fft0(2,4,1,x,y) fft0(2,4,3,x,y) access to x[4p+0] access to x[4p+2] access to x[4p+1] access to x[4p+3]

`fft0(n,s,q,x,y)`

and `fft1(n,s,q,x,y)`

Here, I have assumed that sequence length is 8.
If we move `q`

of `fft0(n,s,q,x,y)`

from 0 to 3 (in other words, form 0 to `s-1`

),
this iteration becomes equivalent to recursion stage-2.
Based on this fact, when we transform the
List-3,
it will be as follows.

#include <complex> #include <cmath> typedef std::complex<double> complex_t; void fft1(int n, int s, complex_t* x, complex_t* y); void fft0(int n, int s, complex_t* x, complex_t* y) // n : sequence length // s : stride // x : input/output sequence // y : work area { const int m = n/2; const double theta0 = 2*M_PI/n; if (n == 1) {} else { for (int q = 0; q < s; q++) { // Iteration for recursive-call for (int p = 0; p < m; p++) { const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0)); const complex_t a = x[q + s*(p + 0)]; const complex_t b = x[q + s*(p + m)]; y[q + s*(2*p + 0)] = a + b; y[q + s*(2*p + 1)] = (a - b) * wp; } } fft1(n/2, 2*s, y, x); // The number of recursive-calls is one because we tie calls with an iteration. } } void fft1(int n, int s, complex_t* x, complex_t* y) // n : sequence length // s : stride // x : input sequence // y : output sequence { const int m = n/2; const double theta0 = 2*M_PI/n; if (n == 1) { for (int q = 0; q < s; q++) y[q] = x[q]; } else { for (int q = 0; q < s; q++) { // Iteration for recursive-call for (int p = 0; p < m; p++) { const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0)); const complex_t a = x[q + s*(p + 0)]; const complex_t b = x[q + s*(p + m)]; y[q + s*(2*p + 0)] = a + b; y[q + s*(2*p + 1)] = (a - b) * wp; } } fft0(n/2, 2*s, y, x); // The number of recursive-calls is one because we tie calls with an iteration. } } void fft(int n, complex_t* x) // Fourier transform // n : sequence length // x : input/output sequence { complex_t* y = new complex_t[n]; // Allocation of work area fft0(n, 1, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] /= n; } void ifft(int n, complex_t* x) // Inverse Fourier transform // n : sequence length // x : input/output sequence { for (int p = 0; p < n; p++) x[p] = conj(x[p]); complex_t* y = new complex_t[n]; // Allocation of work area fft0(n, 1, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] = conj(x[k]); }

The List-5 is still slow.
However, if we swap p-loop and q-loop here,
decisive difference of the speed appears.
The reason why List-5 is slow is
that the code accesses to an array with large intervals
when the stride `s`

becomes larger
(when the recursion stage becomes deeper).
The program will be as follows if we swap p-loop and q-loop.

#include <complex> #include <cmath> typedef std::complex<double> complex_t; void fft1(int n, int s, complex_t* x, complex_t* y); void fft0(int n, int s, complex_t* x, complex_t* y) // n : sequence length // s : stride // x : input/output sequence // y : work area { const int m = n/2; const double theta0 = 2*M_PI/n; if (n == 1) {} else { for (int p = 0; p < m; p++) { const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0)); for (int q = 0; q < s; q++) { const complex_t a = x[q + s*(p + 0)]; const complex_t b = x[q + s*(p + m)]; y[q + s*(2*p + 0)] = a + b; y[q + s*(2*p + 1)] = (a - b) * wp; } } fft1(n/2, 2*s, y, x); } } void fft1(int n, int s, complex_t* x, complex_t* y) // n : sequence length // s : stride // x : input sequence // y : output sequence { const int m = n/2; const double theta0 = 2*M_PI/n; if (n == 1) { for (int q = 0; q < s; q++) y[q] = x[q]; } else { for (int p = 0; p < m; p++) { const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0)); for (int q = 0; q < s; q++) { const complex_t a = x[q + s*(p + 0)]; const complex_t b = x[q + s*(p + m)]; y[q + s*(2*p + 0)] = a + b; y[q + s*(2*p + 1)] = (a - b) * wp; } } fft0(n/2, 2*s, y, x); } } void fft(int n, complex_t* x) // Fourier transform // n : sequence length // x : input/output sequence { complex_t* y = new complex_t[n]; fft0(n, 1, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] /= n; } void ifft(int n, complex_t* x) // Inverse Fourier transform // n : sequence length // x : input/output sequence { for (int p = 0; p < n; p++) x[p] = conj(x[p]); complex_t* y = new complex_t[n]; fft0(n, 1, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] = conj(x[k]); }

Then, the stride of array access becomes 1.
And it becomes further efficient
by placing `cos()`

and `sin()`

to outer loop.
The List-6 is fast.
This is so-called Stockham algorithm.

Well, it is wasteful to write twice about the almost same thing by a mutual recursion. So, it will be as follows if we delete the duplication.

#include <complex> #include <cmath> typedef std::complex<double> complex_t; void fft0(int n, int s, bool eo, complex_t* x, complex_t* y) // n : sequence length // s : stride // eo : x is output if eo == 0, y is output if eo == 1 // x : input sequence(or output sequence if eo == 0) // y : work area(or output sequence if eo == 1) { const int m = n/2; const double theta0 = 2*M_PI/n; if (n == 1) { if (eo) for (int q = 0; q < s; q++) y[q] = x[q]; } else { for (int p = 0; p < m; p++) { const complex_t wp = complex_t(cos(p*theta0), -sin(p*theta0)); for (int q = 0; q < s; q++) { const complex_t a = x[q + s*(p + 0)]; const complex_t b = x[q + s*(p + m)]; y[q + s*(2*p + 0)] = a + b; y[q + s*(2*p + 1)] = (a - b) * wp; } } fft0(n/2, 2*s, !eo, y, x); } } void fft(int n, complex_t* x) // Fourier transform // n : sequence length // x : input/output sequence { complex_t* y = new complex_t[n]; fft0(n, 1, 0, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] /= n; } void ifft(int n, complex_t* x) // Inverse Fourier transform // n : sequence length // x : input/output sequence { for (int p = 0; p < n; p++) x[p] = conj(x[p]); complex_t* y = new complex_t[n]; fft0(n, 1, 0, x, y); delete[] y; for (int k = 0; k < n; k++) x[k] = conj(x[k]); }

By the way, I made the FFT library using the Stockham algorithm which we led here. If you are interested, please to this page. People who are interested in optimization that reduce the accesses to an array, please to this page. In addition, I organized about kinds of Stockham algorithms in the next page.