What's Changed
- fix: choose cuda arthitectures based on cuda version by @guocuimi in #463
- kernel: add grouped gemm support for moe by @guocuimi in #458
- kernel: added oob handling for grouped gemm kernel by @guocuimi in #465
- refactor: add _1 into stride for contiguous dim by @guocuimi in #466
- ci: set cuda arch to native for ci workflows by @guocuimi in #467
- refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in #468
- refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in #469
- fix: skip failed unittests for blackwell gpus by @guocuimi in #472
- feat: added single tile scheduler for attn kernel by @guocuimi in #473
- feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in #474
- refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in #475
- feat: use global residue_mnk for oob handling by @guocuimi in #476
- feat: simplify mask logic to avoid manual index computation by @guocuimi in #477
- feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in #478
- feat: added gtest_main with filters based on compute_capabilities by @guocuimi in #479
- ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in #481
- feat: add tma copy for paged kv by @guocuimi in #480
- feat: added gather tma copy to control smem box size by @guocuimi in #482
- feat: use aggressive compress-mode for fatbin by @guocuimi in #484
- feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in #485
- feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in #483
- feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in #486
- refactor: move kernel code into different folders by @guocuimi in #487
- feat: added KV multi-stages support for attn sm120 by @guocuimi in #489
- refactor: simplify mha block tiling logic by @guocuimi in #488
- feat: added smem and gmem layout selector for attn kernel by @guocuimi in #490
- feat: added args and params for attn kernels by @guocuimi in #491
- feat: added universal fmha runner by @guocuimi in #492
- feat: added kernel builder for attn by @guocuimi in #493
- refactor: change stride for Q/K/V to MNKL by @guocuimi in #494
- upgrade torch to 12.8 by @guocuimi in #496
- ci: fix nccl related build error by @guocuimi in #497
Full Changelog: v0.2.5...v0.2.6