This is an implementaion of threadgroup wide bitonic sort in HLSL.
Sometimes, it is desired to sort elements within a thread group on the GPU. The threadgroup_bitonic_sort.hlsli header file provides multiple variants of the bitonic sort to support any power-of-2 threadgroup size and the number of sortable elements of up to 4096.
- It is agnostic of wave/warp sizes
- It automatically switches to sorting and shuffling within waves/warps by utilising wave intrinsics when the sizes of sorted/shuffled blocks become smaller than the size of waves/warps in a threadgroup (check out AMD RGA codegen on godbolt.org)
- It supports GPUs without wave intrinsic support
- It supports sorting of up to 4096 elements within a thread group (sorting 4096 elements requires the size of a thread group to be 1024 threads)
- For a thread group with
Nthreads, it supports sorting ofN,N * 2orN * 4elements
To build demo.cpp, run build.bat from Visual Studio Command Prompt. The batch file should automatically download the required packages (D3D12, DXC), build and run all shader variants as benchmarks.
The header file can be compiled with DX Compiler release for February 2025 or earlier.
This header file is available to anybody free of charge, under the terms of MIT License (see LICENSE.md).