Hstu attention n0loop fused unroll pr #2896

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

qianfengz wants to merge 175 commits into develop from hstu_attention_n0loop_fused_unroll_pr

+8,340 −15

Contributor

qianfengz commented Sep 22, 2025 •

edited

Loading

This PR brings an implementation of HSTU attention on ck_tile. HSTU attention is very different from the fmha implemented in ck_tile, for details, please refer to the hstu paper

The implementation is well verified on MI300 for both functionalities and targeted performance, but it does not make any optimization for MI350.

To build
#> cd build; ../scripts/cmake-ck-dev.sh .. gfx942; make -j 128 tile_example_hstu_attention

To verify
#> . examples/ck_tile/23_hstu_attention/scripts/test_hstu_attention.sh

The codes of HSTU are all located under the folder examples/ck_tile/23_hstu_attention, but this PR also made some tiny change to the core ck_tile codes under include/ck_tile/core/tensor

qianfengz added 30 commits

March 28, 2025 16:26


          Initial reference implementation of hstu attention

4a0fc29


          fix the jagged mode tensor access in reference_hstu_attention

83f2924


          Add hstu attention kernel implementation, instances and interfaces (b…

121a950

…uilding succeeded)


          Fix and change in example


          Change in HstBlockMasking and kernel/reference codes for using masking

10e72d3


          Fixes and updates

dbcf38a


          Fix in hstu-attention pipeline (which makes some testing cases passed)

dc2f72a


          Fix in kernel and forward dispatch for jagged mode

561d490


          Add several verification test cases

9cb2dca


          Add benchmark_hstu_attention.sh

86c0e45


          Tune the input initialization to avoid over-flow in silu

dd2cd2c


          Update to the scripts and error thresholds

1766e6d


          Add output of estimated TFLOPS

71697d9


          Fix in calculation of total_flops and update benchmark scripts

53e5679


          Update the in pipeline codes

238e78d


          Add IsFirstVLdsBufferOverlapLastKLdsBuffer() check to reduce call of …

c2e6ab8

…s_barrier()


          Update to partially reduce the register spilling

fff13b6


          Use packed cast_tile for fp16

cad1356


          Split HstuBlockMasking into HstuBlockMaskWithLocal and HstuBlockMaskN…

3cd1b13

…oLocal to save vgprs for non-local situations


          Use kM0=128 kN0=64 to completely remove the vgprs spilling

d1749b3


          Remove the comparing of row/col to max_uih_len in masking

226a254


          Use exp2() to calculate exp() for better performance

1351d9c


          Add scripts for comparing with triton

6086ead


          Fix the integer overflow in total_flops calculation

b0ae270


          Remove one line of __builtin_amdgcn_sched_barrier(0)

ca1ae84


          Tiny codes simplification in pipeline

f12a472


          Use shared ring Lds buffers for K/V to avoid over-lapping between fir…

88e54a8

…st-K/last-V or last-K/first-V


          Remove un-needed __builtin_amdgcn_sched_barrier(0)

efc786f


          Fix the GetTileRangeAlongX() to align with the hstu masking definitio…

ee259a8

…n when both causal=true and local=true


          Change gemm0 to iterate along kN0 so that BlockGemm can overlap with …

2546e90

…maksing and siLu

qianfengz added 5 commits

September 13, 2025 06:42


          Smalle update in reference hstu attention

75a7332


          Add HSTU_CHECK() and use it in example codes

8a01016


          Remove un-necessary HSTU_CHECK() callings

072459c


          Remove useless constant statement in the kernel

fec6e8d


          Merge branch 'develop' into hstu_attention_n0loop_fused_unroll_pr

090459d

qianfengz requested review from illsilin, carlushuang, aosewski, poyenc, geyyer, bartekxk, andriy-ca, afagaj, asleepzzz, ThomasNing, coderfeli, aska-0096, cgmillette, shumway, vidyasagar-amd, ddembeckAMD, a team and tenpercent as code owners

September 22, 2025 13:48

spolifroni-amd requested changes

View reviewed changes

Contributor

spolifroni-amd left a comment

Readme needs work for clarity.

example/ck_tile/18_hstu_attention/README.md Outdated

    
            @@ -0,0 +1,64 @@
          
              # HSTU attention operator

                HSTU-attention operator is an operator which takes tensor `q: [batches, seqlen, nhead, hdim_qk]`,  `k: [batches, seqlen, nhead, hdim_qk`,

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
              HSTU-attention operator is an operator which takes tensor `q: [batches, seqlen, nhead, hdim_qk]`,  `k: [batches, seqlen, nhead, hdim_qk`, 
          
             The HSTU-attention operator is an operator which takes as input three tensor `q: [batches, seqlen, nhead, hdim_qk]`,  `k: [batches, seqlen, nhead, hdim_qk`,

example/ck_tile/18_hstu_attention/README.md Outdated

    
              # HSTU attention operator

                HSTU-attention operator is an operator which takes tensor `q: [batches, seqlen, nhead, hdim_qk]`,  `k: [batches, seqlen, nhead, hdim_qk`, 

                `v: [batches, seqlen, nhead, hdim_v]` and some parameters for defining the functional masking as inputs, and do the following:

Contributor

spolifroni-amd Sep 22, 2025

Are the parameters for definiing functional masking, or is the operator for functional masking? It's not clear from the sentence.

But if that's what it is, then this can be changed to:

Suggested change

      
              `v: [batches, seqlen, nhead, hdim_v]` and some parameters for defining the functional masking as inputs, and do the following:
          
              `v: [batches, seqlen, nhead, hdim_v]`, as well as parameters that define functional masking to do the following:
          
            ``

example/ck_tile/18_hstu_attention/README.md Outdated

    
                HSTU-attention operator is an operator which takes tensor `q: [batches, seqlen, nhead, hdim_qk]`,  `k: [batches, seqlen, nhead, hdim_qk`, 

                `v: [batches, seqlen, nhead, hdim_v]` and some parameters for defining the functional masking as inputs, and do the following:

                 * Multiply `q: [batches, seqlen, nhead, hdim_qk]` with `k: [batches, seqlen, nhead, hdim_k]` to get temporary tensor `s: [batches, nhead, seqlen, seqlen]`

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
               * Multiply `q: [batches, seqlen, nhead, hdim_qk]` with `k: [batches, seqlen, nhead, hdim_k]` to get temporary tensor `s: [batches, nhead, seqlen, seqlen]`
          
               * Multiply `q: [batches, seqlen, nhead, hdim_qk]` with `k: [batches, seqlen, nhead, hdim_k]` to get the intermediate tensor `s: [batches, nhead, seqlen, seqlen]`

example/ck_tile/18_hstu_attention/README.md Outdated

    
                `v: [batches, seqlen, nhead, hdim_v]` and some parameters for defining the functional masking as inputs, and do the following:

                 * Multiply `q: [batches, seqlen, nhead, hdim_qk]` with `k: [batches, seqlen, nhead, hdim_k]` to get temporary tensor `s: [batches, nhead, seqlen, seqlen]`

                 * Update `s` by filtering its values according to a special functional mask, which includes the logics of lower-triangular and diagonal window causal mask

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
               * Update `s` by filtering its values according to a special functional mask, which includes the logics of lower-triangular and diagonal window causal mask
          
               * Update `s` by filtering it with a functional mask that includes a lower-triangular mask, a diagonal window causal mask, and

example/ck_tile/18_hstu_attention/README.md Outdated

    
                 * Multiply `q: [batches, seqlen, nhead, hdim_qk]` with `k: [batches, seqlen, nhead, hdim_k]` to get temporary tensor `s: [batches, nhead, seqlen, seqlen]`

                 * Update `s` by filtering its values according to a special functional mask, which includes the logics of lower-triangular and diagonal window causal mask

                   as well assequence mask

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
                 as well assequence mask
          
                 a sequence mask.

example/ck_tile/18_hstu_attention/README.md Outdated

    
                 * Update `s` by filtering its values according to a special functional mask, which includes the logics of lower-triangular and diagonal window causal mask

                   as well assequence mask

                 * Do element-wise SiLu on the `lower seqlen` dimension of `s` to get temporary tensor `p: [batches, nhead, seqlen, seqlen]`

                 * Multiply `p : [batches, nhead, seqlen, seqlen]` with `v: [batches, seqlen, nhead, hdim_v]` to get final output `o: [batches, seqlen_q, nhead, headsz_v]`

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
               * Multiply `p : [batches, nhead, seqlen, seqlen]` with `v: [batches, seqlen, nhead, hdim_v]` to get final output `o: [batches, seqlen_q, nhead, headsz_v]` 
          
               * Multiply `p : [batches, nhead, seqlen, seqlen]` with `v: [batches, seqlen, nhead, hdim_v]` to get the final tensor `o: [batches, seqlen_q, nhead, headsz_v]`

example/ck_tile/18_hstu_attention/README.md Outdated

    
                   as well assequence mask

                 * Do element-wise SiLu on the `lower seqlen` dimension of `s` to get temporary tensor `p: [batches, nhead, seqlen, seqlen]`

                 * Multiply `p : [batches, nhead, seqlen, seqlen]` with `v: [batches, seqlen, nhead, hdim_v]` to get final output `o: [batches, seqlen_q, nhead, headsz_v]` 

                 * Jagged inputs are also supported, where each batch has separate seqlen defined by the `sequence_offsets[]`

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
               * Jagged inputs are also supported, where each batch has separate seqlen defined by the `sequence_offsets[]`
          
            Jagged inputs are also supported, where each batch has separate seqlen defined by the `sequence_offsets[]`

This isn't a thing that the operator does, so it shouldn't be in the same bullet list

example/ck_tile/18_hstu_attention/README.md Outdated

    
              ## implementation

                 The operator is implemented using a fused kernel in the example:

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
               The operator is implemented using a fused kernel in the example:
          
               The operator is implemented using a fused kernel:

example/ck_tile/18_hstu_attention/README.md Outdated

    
                 The operator is implemented using a fused kernel in the example:

                 *  Tensor S and Tensor P only exist in VGPRs as per-workgroup tiles, no global memory access is needed

Contributor

spolifroni-amd Sep 22, 2025

This doesn't need to be a bullet point. You can combine it with the sentence above.

example/ck_tile/18_hstu_attention/README.md Outdated

    
                 #>  . example/ck_tile/07_hstu_attention/test_hstu_attention.sh

                 ```

                 Check the example file `example_hstu_attention.cpp` for an understanding of the command-line arguments. Which is like the following:

Contributor

spolifroni-amd Sep 22, 2025

Suggested change

      
               Check the example file `example_hstu_attention.cpp` for an understanding of the command-line arguments. Which is like the following:
          
               Check the example file `example_hstu_attention.cpp` for more information about the command-line arguments.

To be honest, I'd rather see the explanations here, or at least have the code snippet commented. It doesn't need to be everything but some of it.

asleepzzz and others added 6 commits

September 23, 2025 08:41


          Merge branch 'develop' into hstu_attention_n0loop_fused_unroll_pr

a57413e


          Move hstu from fhold 18_hstu_attention to 23_hstu_attention

f158b32


          Update to hstu READM.md

fb22e5e


          Simplify the warp_gemm definitions in GetQKBlockGemm and GetKVBlockGemm

6ba01e7


          Fix in GetQKBlockGemm()

7a67ddc


          re-format using clang-format-18

cc16906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

spolifroni-amd spolifroni-amd requested changes

illsilin Awaiting requested review from illsilin illsilin is a code owner

carlushuang Awaiting requested review from carlushuang carlushuang is a code owner

aosewski Awaiting requested review from aosewski aosewski is a code owner

poyenc Awaiting requested review from poyenc poyenc is a code owner

geyyer Awaiting requested review from geyyer geyyer is a code owner

bartekxk Awaiting requested review from bartekxk bartekxk is a code owner

andriy-ca Awaiting requested review from andriy-ca andriy-ca is a code owner

afagaj Awaiting requested review from afagaj afagaj is a code owner

asleepzzz Awaiting requested review from asleepzzz asleepzzz is a code owner

ThomasNing Awaiting requested review from ThomasNing ThomasNing is a code owner

coderfeli Awaiting requested review from coderfeli coderfeli is a code owner

aska-0096 Awaiting requested review from aska-0096 aska-0096 is a code owner

cgmillette Awaiting requested review from cgmillette cgmillette is a code owner

shumway Awaiting requested review from shumway shumway is a code owner

vidyasagar-amd Awaiting requested review from vidyasagar-amd vidyasagar-amd is a code owner

ddembeckAMD Awaiting requested review from ddembeckAMD ddembeckAMD is a code owner

tenpercent Awaiting requested review from tenpercent tenpercent is a code owner

Requested changes must be addressed to merge this pull request.

Labels

None yet