Documentation of OTFFT

[Documentation of OTFFT]

[What is the OTFFT]

OTFFT is a high-speed FFT library using the Stockham's algorithm and AVX. In addition, C++ template metaprogramming technique is used in OTFFT. And OTFFT is a mixed-radix FFT.

[Caution]

OTFFT is optimized for the environment used to compile. In other words, if we compile it on AVX-supported environment, it becomes the binary that uses AVX. If someone executes this binary on AVX-unsupported environment, of course, it crashes by causing the exception. For this reason, OTFFT is not suitable for the use for distributing the compiled binary to all people. It is being assumed that people use OTFFT to compile a program of numerical calculation from the source code.

[Necessary files]

If you expand the "otfft-11.6e.tar.gz", you get several files. Something necessary to you is "otfft" folder in these files.

[Setup]

In order to tune up this library, you need to execute the following commands in "otfft" folder.

    make ffttune
    ./ffttune
    make otfft.o

For the compilation, it takes a very long time. Please be patient. Some of linux will be very slow when it is run on 8 threads. In that case, please set the "OMP_NUM_THREADS" environment variable to 7.

[How to use]

After you add the "otfft" folder in the compiler include path, please execute the following commands. "hello.cpp" is your program.

    clang++-mp-4.0 -c -O3 hello.cpp
    clang++-mp-4.0 hello.o otfft/otfft.o -L/opt/local/lib/libomp -lomp -o hello

Here, I'm assuming that there is a "otfft" folder in the current folder. "clang++-mp-4.0" is MacPorts Clang compiler.

[Single Thread Mode]

To use the single thread mode, you need to compile the OTFFT with the condition that DO_SINGLE_THREAD macro is defined at "otfft_misc.h". And, please start from "make ffttune".

Following such transforms are provided.

[Complex-to-Complex FFT]

    #include "otfft/otfft.h"
    using OTFFT::complex_t;
    using OTFFT::simd_malloc;
    using OTFFT::simd_free;

    void f(int N)
    {
        complex_t* x = (complex_t*) simd_malloc(N*sizeof(complex_t));
        // do something
        OTFFT::FFT fft(N); // creation of FFT object. N is sequence length.
        fft.fwd(x);        // execution of transformation. x is input and output
        // do something
        simd_free(x);
    }

complex_t is defined as follows.

    struct complex_t
    {
        double Re, Im;

        complex_t() : Re(0), Im(0) {}
        complex_t(const double& x) : Re(x), Im(0) {}
        complex_t(const double& x, const double& y) : Re(x), Im(y) {}
        complex_t(const std::complex<double>& z) : Re(z.real()), Im(z.imag()) {}
        operator std::complex<double>() const { return std::complex(Re, Im); }

        // ...
    };

There are member functions, such as the following.

    fwd(x)  -- DFT(with 1/N normalization)  x:input/output
    fwd0(x) -- DFT(non normalization)       x:input/output
    fwdu(x) -- DFT(unitary transformation)  x:input/output
    fwdn(x) -- DFT(with 1/N normalization)  x:input/output

    inv(x)  -- IDFT(non normalization)      x:input/output
    inv0(x) -- IDFT(non normalization)      x:input/output
    invu(x) -- IDFT(unitary transformation) x:input/output
    invn(x) -- IDFT(with 1/N normalization) x:input/output

To change the FFT size, do the following.

    fft.setup(2 * N);

To use in a multi-threaded environment, we do as follows.

    #include "otfft/otfft.h"
    using OTFFT::complex_t;
    using OTFFT::simd_malloc;
    using OTFFT::simd_free;

    void f(int N)
    {
        complex_t* x = (complex_t*) simd_malloc(N*sizeof(complex_t));
        complex_t* y = (complex_t*) simd_malloc(N*sizeof(complex_t));
        // do someting
        OTFFT::FFT0 fft(N);
        fft.fwd(x, y); // x is input/output. y is work area
        // do something
        fft.inv(x, y); // x is input/output. y is work area
        // do someting
        simd_free(y);
        simd_free(x);
    }

Please note that "OTFFT::FFT" was changed to "OTFFT::FFT0".

[Real-to-Complex FFT]

    #include "otfft/otfft.h"
    using OTFFT::complex_t;
    using OTFFT::simd_malloc;
    using OTFFT::simd_free;

    void f(int N)
    {
        double*    x = (double*)    simd_malloc(N*sizeof(double));
        complex_t* y = (complex_t*) simd_malloc(N*sizeof(complex_t));
        // do something
        OTFFT::RFFT rfft(N);
        rfft.fwd(x, y); // x is input. y is outout
        // do something
        simd_free(y);
        simd_free(x);
    }

N must be an even number. There are member functions, such as the following.

    fwd(x, y)  -- DFT(with 1/N normalization)  x:input, y:output
    fwd0(x, y) -- DFT(non normalization)       x:input, y:output
    fwdu(x, y) -- DFT(unitary transformation)  x:input, y:output
    fwdn(x, y) -- DFT(with 1/N normalization)  x:input, y:output

    inv(y, x)  -- IDFT(non normalization)      y:input, x:output
    inv0(y, x) -- IDFT(non normalization)      y:input, x:output
    invu(y, x) -- IDFT(unitary transformation) y:input, x:output
    invn(y, x) -- IDFT(with 1/N normalization) y:input, x:output

inv,inv0,invu,invn will destroy the input.

[Discrete Cosine Transformation(DCT-II)]

This transformation, orthogonalization is not executed.

    #include "otfft/otfft.h"
    using OTFFT::complex_t;
    using OTFFT::simd_malloc;
    using OTFFT::simd_free;

    void f(int N)
    {
        double* x = (double*) simd_malloc(N*sizeof(double));
        // do something
        OTFFT::DCT dct(N);
        dct.fwd(x); // execution of DCT. x is input and output
        // do something
        simd_free(x);
    }

N must be an even number. There are member functions, such as the following.

    fwd(x)  -- DCT(with 1/N normalization) x:input/output
    fwd0(x) -- DCT(non normalization)      x:input/output
    fwdn(x) -- DCT(with 1/N normalization) x:input/output

    inv(x)  -- IDCT(non normalization)     x:input/output
    inv0(x) -- IDCT(non normalization)     x:input/output
    invn(x) -- IDCT(with 1/N nomalization) x:input/output

To use in a multi-threaded environment, we do as follows.

    #include "otfft/otfft.h"
    using OTFFT::complex_t;
    using OTFFT::simd_malloc;
    using OTFFT::simd_free;

    void f(int N)
    {
        double*    x = (double*)    simd_malloc(N*sizeof(double));
        double*    y = (double*)    simd_malloc(N*sizeof(double));
        complex_t* z = (complex_t*) simd_malloc(N*sizeof(complex_t));
        // do something
        OTFFT::DCT0 dct(N);
        dct.fwd(x, y, z); // x is input/output. y,z are work area
        // do something
        dct.inv(x, y, z); // x is input/output. y,z are work area
        // do somthing
        simd_free(z);
        simd_free(y);
        simd_free(x);
    }

Please note that "OTFFT::DCT" was changed to "OTFFT::DCT0".

[Bluestein's FFT]

Bluestein's FFT is the FFT of any sequence length. Even if the sequence length is a big prime number, the order of complexity is O(N log N).

    #include "otfft/otfft.h"
    using OTFFT::complex_t;
    using OTFFT::simd_malloc;
    using OTFFT::simd_free;

    void f(int N)
    {
        complex_t* x = (complex_t*) simd_malloc(N*sizeof(complex_t));
        // do something
        OTFFT::Bluestein bst(N);
        bst.fwd(x); // execution of Bluestein's FFT. x is input and output
        // do something
        simd_free(x);
    }

There are member functions, such as the following.

    fwd(x)  -- DFT(with 1/N normalization)  x:input/output
    fwd0(x) -- DFT(non normalization)       x:input/output
    fwdu(x) -- DFT(unitary transformation)  x:input/output
    fwdn(x) -- DFT(with 1/N normalization)  x:input/output

    inv(x)  -- IDFT(non normalization)      x:input/output
    inv0(x) -- IDFT(non normalization)      x:input/output
    invu(x) -- IDFT(unitary transformation) x:input/output
    invn(x) -- IDFT(with 1/N normalization) x:input/output

To use in a multi-threaded environmrnt, we need to create objects of the same number as the number of threads.

[Benchmark]

Benchmark programs are bundled with OTFFT. These benchmarks require FFTW and OOURA's FFT Package. Please refer to the following URL for OOURA's FFT Package.

    http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html

Please install the FFTW and place OOURA's "fftsg.c" file into "otfft-11.5e" folder. Then, execute the following command.

    make fftbench1

And, execute the following command to execute the benchmark.

    ./fftbench1

[NEW: Benchmark2]

If you want to benchmark with sufficient performance of FFTW, you need to benchmark in FFTW_MEASURE mode. In that case, run

    make fftbench5

and then

    ./fftbench5

Of course, you need to install FFTW to make fftbench5. You also need to bring "fftsg.c" from Oura's FFT package. If you just want to compare FFTW and OTFFT, run

    make ibench wisdom="wisdom_name"

However, you will need to create the wisdom file described below by yourself.

FFTW wisdom file is required to run fftbench5. It will be created automatically when you run "make fftbench5", but it will take a very long time, so be prepared for this. After start it, please have a tea time. It will probably take about 1 hour and 30 minutes.

If you would like to create a wisdom file yourself, please run

    bash makewisdom.sh wisdom_name

"wisdom_name" is the filename of the wisdom file that will be created. To run fftbench5 specifying the filename of the wisdom file, run

    ./fftbench5 1 22 wisdom_name

If 22 is changed to 24, this command benchmark up to a size of 2^24.