HighSpeedFFT

HighSpeedFFT is a high-performance, lightweight C library for computing Fast Fourier Transforms (FFTs) on complex and real-valued signals. Designed as a drop-in alternative to FFTW, it supports mixed-radix FFTs for lengths factorable into small primes (2, 3, 4, 5, 7, 8, 11, 13, 16, 32) and Bluestein’s algorithm for arbitrary lengths. The library leverages hand-tuned SIMD optimizations (SSE2, AVX2, AVX-512) with FMA support, minimal dependencies, and a compact footprint (~25 KiB code).

Features

Mixed-Radix FFT:
- Decimation-In-Time (DIT) algorithm optimized for signal lengths factorable into primes 2, 3, 4, 5, 7, 8, 11, 13, 16, 32.
- Dedicated butterfly implementations with AVX2, AVX-512, SSE2, and scalar codepaths for maximum performance and portability.
- Radix-3, 5, 11, and 13 use Rader/Winograd-style optimizations; radix-16 and 32 use multi-stage decompositions (4×4 and 4×4×2).
Bluestein’s Algorithm:
- Handles arbitrary signal lengths via chirp z-transform, ideal for prime or non-factorable sizes.
- Precomputed chirp sequences for small lengths (≤64) stored contiguously for cache efficiency.
- Thread-safe initialization and cleanup using pthread_once.
SIMD Acceleration:
- Hand-tuned AVX2 and AVX-512 paths process 4 or 8 complex points per iteration.
- SSE2 tails handle 1–2 leftover samples; scalar tails cover 0–3 samples or non-SIMD builds.
- FMA instructions and configurable prefetching optimize performance across architectures.
- AoS (Array-of-Structures) for small radices; SoA (Structure-of-Arrays) for radix-11.
Memory Efficiency:
- No per-call allocations; recursion and twiddle factors use a preallocated scratch buffer.
- Precomputed twiddle tables for radices 2, 3, 4, 5, 7, 8, 11, 13 reduce runtime computation.
- Compact footprint (~25 KiB code) with minimal table sizes.
Real-to-Complex FFTs:
- Supports real-to-complex (R2C) and complex-to-real (C2R) transforms with half-complex storage (N/2 + 1 outputs).
Robustness:
- Comprehensive error checking for invalid inputs, insufficient scratch space, and alignment issues.
- Thread-safe Bluestein chirp handling and robust memory management.
- Unaligned load/store support ensures correctness on any input.
Testing:
- Extensive test suite with round-trip forward/inverse FFTs, radix-specific tests, and error condition checks.
- Mean Squared Error (MSE) validated against FFTW within 1e-10.
- Optional GoogleTest integration for detailed unit testing.

Comparison to FFTW

Criterion	HighSpeedFFT	FFTW
Algorithm Support	Mixed-radix (2, 3, 4, 5, 7, 8, 11, 13, 16, 32) + Bluestein	Cooley-Tukey, Rader, Bluestein, etc.
Vectorization	Hand-tuned SSE2, AVX2, AVX-512, FMA	Auto-SIMD (SSE, AVX, etc.)
Dependencies	`libm`, `pthread` (optional)	`pthread`, `libsimd`, others
Memory Layout	Contiguous real/imag buffers	Plan-specific allocators
Performance	Within 5–10% of FFTW for factorable N; ~30% slower for Bluestein	Highly tuned, wide radix support
Footprint	~25 KiB code	~300 KiB code + data

Benchmarks (Intel Xeon Gold 6226R, single-threaded):

Mixed-radix (N = 2¹² to 2²⁰): Within 5–10% of FFTW.
Bluestein (large primes): ~30% slower due to convolution overhead.
Real-to-complex: Comparable to FFTW for factorable lengths.

Installation

Prerequisites

C compiler with C99 support (e.g., GCC, Clang, MSVC).
Optional: pthread for thread-safe Bluestein initialization (POSIX systems).
Optional: GoogleTest for unit testing.
SIMD support: SSE2 (required), AVX2/FMA, AVX-512 (optional, auto-detected).

Build Instructions

git clone https://github.com/yourusername/HighSpeedFFT.git
cd HighSpeedFFT
mkdir build && cd build
cmake ..  # Optionally: -DENABLE_GTEST=ON for GoogleTest
make

To run tests:

# Standalone test harness
./test_mixedRadixFFT

# With GoogleTest
cmake -DENABLE_GTEST=ON ..
make test

CMake Options

-DENABLE_GTEST=ON: Enable GoogleTest-based test suite (default: OFF).
-DFFT_PREFETCH_DISTANCE=N: Set prefetch distance for cache optimization (default: 8; try 4, 8, 16).
SIMD flags are auto-detected:
- GCC/Clang: -O3 -mavx2 -mfma -msse2
- MSVC: /O2 /arch:AVX2 /fp:fast

Usage

Complex FFT

#include "highspeedFFT.h"

int main() {
    int N = 1024; // Signal length
    fft_data *in = malloc(N * sizeof(fft_data));
    fft_data *out = malloc(N * sizeof(fft_data));

    // Initialize input (example)
    for (int i = 0; i < N; i++) {
        in[i].re = /* real part */;
        in[i].im = /* imag part */;
    }

    // Forward FFT
    fft_object fwd = fft_init(N, +1);
    fft_exec(fwd, in, out);

    // Inverse FFT (normalize by 1/N)
    fft_object inv = fft_init(N, -1);
    fft_exec(inv, out, in);
    for (int i = 0; i < N; i++) {
        in[i].re /= N;
        in[i].im /= N;
    }

    free_fft(fwd);
    free_fft(inv);
    free(in);
    free(out);
    return 0;
}

Real-to-Complex FFT

#include "highspeedFFT.h"

int main() {
    int N = 1024; // Signal length
    double *real_in = malloc(N * sizeof(double));
    fft_data *hc = malloc((N/2 + 1) * sizeof(fft_data));

    // Initialize real input
    for (int i = 0; i < N; i++) {
        real_in[i] = /* real data */;
    }

    // Real-to-complex
    fft_real_object r2c = fft_real_init(N, +1);
    fft_r2c_exec(r2c, real_in, hc);

    // Complex-to-real (normalize by 1/N)
    fft_real_object c2r = fft_real_init(N, -1);
    double *real_out = malloc(N * sizeof(double));
    fft_c2r_exec(c2r, hc, real_out);
    for (int i = 0; i < N; i++) {
        real_out[i] /= N;
    }

    free_fft_real(r2c);
    free_fft_real(c2r);
    free(real_in);
    free(hc);
    free(real_out);
    return 0;
}

Note: The library produces unnormalized FFT outputs. For inverse transforms, divide by N to normalize.

Technical Details

Algorithm Overview

Mixed-Radix FFT:
- Uses Decimation-In-Time (DIT) algorithm with optimized butterflies for radices 2, 3, 4, 5, 7, 8, 11, 13, 16, 32.
- Precomputed twiddle tables reduce runtime computation for supported radices.
- General radix fallback for unsupported lengths.
- Scratch buffer handles recursion stages, avoiding dynamic allocations.
Bluestein’s Algorithm:
- Converts arbitrary-length DFTs into convolutions using chirp z-transform.
- Precomputed chirp sequences (N ≤ 64) stored in a contiguous all_chirps array.
- Thread-safe initialization with pthread_once and robust cleanup.
Memory Layout:
- Complex numbers stored as fft_data (struct with double re, im).
- Scratch buffer partitioned into sub-FFT outputs and optional twiddles.
- Aligned allocations (32-byte for AVX2, 64-byte for AVX-512) via _mm_malloc.

SIMD Optimizations

Hand-Tuned Vectorization:
- AVX2 and AVX-512 paths process 4 or 8 complex points per iteration, respectively, leveraging 256-bit and 512-bit wide registers for maximum throughput.
- SSE2 tails handle 1–2 leftover samples; scalar tails cover 0–3 samples or non-SIMD builds for full portability.
FMA Instructions:
- Uses fused multiply-add (FMA) instructions (FMADD, FMSUB) for complex arithmetic when available (e.g., on Intel Haswell or later, AMD Zen).
- Graceful fallbacks to separate multiply/add operations on older CPUs without FMA support.
AoS and SoA Layouts:
- Array-of-Structures (AoS) layout ([re0, im0, re1, im1, ...]) for small radices (2, 3, 4, 5, 7, 8, 16, 32) to simplify memory access and reduce overhead.
- Structure-of-Arrays (SoA) layout ([re0, re1, ...], [im0, im1, ...]) for radix-11 to optimize FMA-based computations in symmetric pair calculations.
- Efficient AoS-to-SoA conversions (e.g., deinterleave4_aos_to_soa) for radix-11 AVX2 path, minimizing shuffle overhead.
Loop Unrolling:
- Aggressive unrolling (16x for AVX-512, 8x/4x for AVX2, 2x for SSE2) exposes instruction-level parallelism (ILP), reducing loop overhead and maximizing CPU pipeline utilization.
- Tailored unrolling factors to balance register pressure and cache usage.
Prefetching:
- Configurable prefetch distance (FFT_PREFETCH_DISTANCE, default: 8 cache lines) to hide memory latency, tunable for different CPU architectures (e.g., Intel Skylake, AMD Zen).
- Uses _mm_prefetch with _MM_HINT_T0 to preload twiddles and outputs, improving cache efficiency in deeper radix stages.

Performance Optimizations

Twiddle Tables: Precomputed for radices 2, 3, 4, 5, 7, 8, 11, 13, minimizing trigonometric computations.
Cache Efficiency: Contiguous memory layouts and prefetching reduce cache misses.
No Dynamic Allocations: Scratch buffer handles all temporary storage, ensuring predictable performance.

Robustness

Error handling for invalid inputs, insufficient scratch space, and alignment issues.
Thread-safe Bluestein chirp initialization and cleanup using pthread_once.
Unaligned load/store support (_mm256_loadu_pd, _mm512_storeu_pd) ensures correctness on any input.

Acknowledgments

Inspired by FFTW’s performance and design principles.

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.vscode		.vscode
build		build
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HighSpeedFFT

Features

Comparison to FFTW

Installation

Prerequisites

Build Instructions

CMake Options

Usage

Complex FFT

Real-to-Complex FFT

Technical Details

Algorithm Overview

SIMD Optimizations

Performance Optimizations

Robustness

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

Tugbars/Mixed-Radix-Fast-Fourier-Transform

Folders and files

Latest commit

History

Repository files navigation

HighSpeedFFT

Features

Comparison to FFTW

Installation

Prerequisites

Build Instructions

CMake Options

Usage

Complex FFT

Real-to-Complex FFT

Technical Details

Algorithm Overview

SIMD Optimizations

Performance Optimizations

Robustness

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages