HighSpeedFFT is a high-performance, lightweight C library for computing Fast Fourier Transforms (FFTs) on complex and real-valued signals. Designed as a drop-in alternative to FFTW, it supports mixed-radix FFTs for lengths factorable into small primes (2, 3, 4, 5, 7, 8, 11, 13, 16, 32) and Bluestein’s algorithm for arbitrary lengths. The library leverages hand-tuned SIMD optimizations (SSE2, AVX2, AVX-512) with FMA support, minimal dependencies, and a compact footprint (~25 KiB code).
- Mixed-Radix FFT:
- Decimation-In-Time (DIT) algorithm optimized for signal lengths factorable into primes 2, 3, 4, 5, 7, 8, 11, 13, 16, 32.
- Dedicated butterfly implementations with AVX2, AVX-512, SSE2, and scalar codepaths for maximum performance and portability.
- Radix-3, 5, 11, and 13 use Rader/Winograd-style optimizations; radix-16 and 32 use multi-stage decompositions (4×4 and 4×4×2).
- Bluestein’s Algorithm:
- Handles arbitrary signal lengths via chirp z-transform, ideal for prime or non-factorable sizes.
- Precomputed chirp sequences for small lengths (≤64) stored contiguously for cache efficiency.
- Thread-safe initialization and cleanup using
pthread_once
.
- SIMD Acceleration:
- Hand-tuned AVX2 and AVX-512 paths process 4 or 8 complex points per iteration.
- SSE2 tails handle 1–2 leftover samples; scalar tails cover 0–3 samples or non-SIMD builds.
- FMA instructions and configurable prefetching optimize performance across architectures.
- AoS (Array-of-Structures) for small radices; SoA (Structure-of-Arrays) for radix-11.
- Memory Efficiency:
- No per-call allocations; recursion and twiddle factors use a preallocated scratch buffer.
- Precomputed twiddle tables for radices 2, 3, 4, 5, 7, 8, 11, 13 reduce runtime computation.
- Compact footprint (~25 KiB code) with minimal table sizes.
- Real-to-Complex FFTs:
- Supports real-to-complex (R2C) and complex-to-real (C2R) transforms with half-complex storage (N/2 + 1 outputs).
- Robustness:
- Comprehensive error checking for invalid inputs, insufficient scratch space, and alignment issues.
- Thread-safe Bluestein chirp handling and robust memory management.
- Unaligned load/store support ensures correctness on any input.
- Testing:
- Extensive test suite with round-trip forward/inverse FFTs, radix-specific tests, and error condition checks.
- Mean Squared Error (MSE) validated against FFTW within 1e-10.
- Optional GoogleTest integration for detailed unit testing.
Criterion | HighSpeedFFT | FFTW |
---|---|---|
Algorithm Support | Mixed-radix (2, 3, 4, 5, 7, 8, 11, 13, 16, 32) + Bluestein | Cooley-Tukey, Rader, Bluestein, etc. |
Vectorization | Hand-tuned SSE2, AVX2, AVX-512, FMA | Auto-SIMD (SSE, AVX, etc.) |
Dependencies | libm , pthread (optional) |
pthread , libsimd , others |
Memory Layout | Contiguous real/imag buffers | Plan-specific allocators |
Performance | Within 5–10% of FFTW for factorable N; ~30% slower for Bluestein | Highly tuned, wide radix support |
Footprint | ~25 KiB code | ~300 KiB code + data |
Benchmarks (Intel Xeon Gold 6226R, single-threaded):
- Mixed-radix (N = 2¹² to 2²⁰): Within 5–10% of FFTW.
- Bluestein (large primes): ~30% slower due to convolution overhead.
- Real-to-complex: Comparable to FFTW for factorable lengths.
- C compiler with C99 support (e.g., GCC, Clang, MSVC).
- Optional:
pthread
for thread-safe Bluestein initialization (POSIX systems). - Optional: GoogleTest for unit testing.
- SIMD support: SSE2 (required), AVX2/FMA, AVX-512 (optional, auto-detected).
git clone https://github.com/yourusername/HighSpeedFFT.git
cd HighSpeedFFT
mkdir build && cd build
cmake .. # Optionally: -DENABLE_GTEST=ON for GoogleTest
make
To run tests:
# Standalone test harness
./test_mixedRadixFFT
# With GoogleTest
cmake -DENABLE_GTEST=ON ..
make test
-DENABLE_GTEST=ON
: Enable GoogleTest-based test suite (default: OFF).-DFFT_PREFETCH_DISTANCE=N
: Set prefetch distance for cache optimization (default: 8; try 4, 8, 16).- SIMD flags are auto-detected:
- GCC/Clang:
-O3 -mavx2 -mfma -msse2
- MSVC:
/O2 /arch:AVX2 /fp:fast
- GCC/Clang:
#include "highspeedFFT.h"
int main() {
int N = 1024; // Signal length
fft_data *in = malloc(N * sizeof(fft_data));
fft_data *out = malloc(N * sizeof(fft_data));
// Initialize input (example)
for (int i = 0; i < N; i++) {
in[i].re = /* real part */;
in[i].im = /* imag part */;
}
// Forward FFT
fft_object fwd = fft_init(N, +1);
fft_exec(fwd, in, out);
// Inverse FFT (normalize by 1/N)
fft_object inv = fft_init(N, -1);
fft_exec(inv, out, in);
for (int i = 0; i < N; i++) {
in[i].re /= N;
in[i].im /= N;
}
free_fft(fwd);
free_fft(inv);
free(in);
free(out);
return 0;
}
#include "highspeedFFT.h"
int main() {
int N = 1024; // Signal length
double *real_in = malloc(N * sizeof(double));
fft_data *hc = malloc((N/2 + 1) * sizeof(fft_data));
// Initialize real input
for (int i = 0; i < N; i++) {
real_in[i] = /* real data */;
}
// Real-to-complex
fft_real_object r2c = fft_real_init(N, +1);
fft_r2c_exec(r2c, real_in, hc);
// Complex-to-real (normalize by 1/N)
fft_real_object c2r = fft_real_init(N, -1);
double *real_out = malloc(N * sizeof(double));
fft_c2r_exec(c2r, hc, real_out);
for (int i = 0; i < N; i++) {
real_out[i] /= N;
}
free_fft_real(r2c);
free_fft_real(c2r);
free(real_in);
free(hc);
free(real_out);
return 0;
}
Note: The library produces unnormalized FFT outputs. For inverse transforms, divide by N
to normalize.
- Mixed-Radix FFT:
- Uses Decimation-In-Time (DIT) algorithm with optimized butterflies for radices 2, 3, 4, 5, 7, 8, 11, 13, 16, 32.
- Precomputed twiddle tables reduce runtime computation for supported radices.
- General radix fallback for unsupported lengths.
- Scratch buffer handles recursion stages, avoiding dynamic allocations.
- Bluestein’s Algorithm:
- Converts arbitrary-length DFTs into convolutions using chirp z-transform.
- Precomputed chirp sequences (N ≤ 64) stored in a contiguous
all_chirps
array. - Thread-safe initialization with
pthread_once
and robust cleanup.
- Memory Layout:
- Complex numbers stored as
fft_data
(struct withdouble re, im
). - Scratch buffer partitioned into sub-FFT outputs and optional twiddles.
- Aligned allocations (32-byte for AVX2, 64-byte for AVX-512) via
_mm_malloc
.
- Complex numbers stored as
- Hand-Tuned Vectorization:
- AVX2 and AVX-512 paths process 4 or 8 complex points per iteration, respectively, leveraging 256-bit and 512-bit wide registers for maximum throughput.
- SSE2 tails handle 1–2 leftover samples; scalar tails cover 0–3 samples or non-SIMD builds for full portability.
- FMA Instructions:
- Uses fused multiply-add (FMA) instructions (
FMADD
,FMSUB
) for complex arithmetic when available (e.g., on Intel Haswell or later, AMD Zen). - Graceful fallbacks to separate multiply/add operations on older CPUs without FMA support.
- Uses fused multiply-add (FMA) instructions (
- AoS and SoA Layouts:
- Array-of-Structures (AoS) layout (
[re0, im0, re1, im1, ...]
) for small radices (2, 3, 4, 5, 7, 8, 16, 32) to simplify memory access and reduce overhead. - Structure-of-Arrays (SoA) layout (
[re0, re1, ...]
,[im0, im1, ...]
) for radix-11 to optimize FMA-based computations in symmetric pair calculations. - Efficient AoS-to-SoA conversions (e.g.,
deinterleave4_aos_to_soa
) for radix-11 AVX2 path, minimizing shuffle overhead.
- Array-of-Structures (AoS) layout (
- Loop Unrolling:
- Aggressive unrolling (16x for AVX-512, 8x/4x for AVX2, 2x for SSE2) exposes instruction-level parallelism (ILP), reducing loop overhead and maximizing CPU pipeline utilization.
- Tailored unrolling factors to balance register pressure and cache usage.
- Prefetching:
- Configurable prefetch distance (
FFT_PREFETCH_DISTANCE
, default: 8 cache lines) to hide memory latency, tunable for different CPU architectures (e.g., Intel Skylake, AMD Zen). - Uses
_mm_prefetch
with_MM_HINT_T0
to preload twiddles and outputs, improving cache efficiency in deeper radix stages.
- Configurable prefetch distance (
- Twiddle Tables: Precomputed for radices 2, 3, 4, 5, 7, 8, 11, 13, minimizing trigonometric computations.
- Cache Efficiency: Contiguous memory layouts and prefetching reduce cache misses.
- No Dynamic Allocations: Scratch buffer handles all temporary storage, ensuring predictable performance.
- Error handling for invalid inputs, insufficient scratch space, and alignment issues.
- Thread-safe Bluestein chirp initialization and cleanup using
pthread_once
. - Unaligned load/store support (
_mm256_loadu_pd
,_mm512_storeu_pd
) ensures correctness on any input.
- Inspired by FFTW’s performance and design principles.