Skip to content

v0.2.6

Latest

Choose a tag to compare

@github-actions github-actions released this 13 Sep 00:06
· 6 commits to main since this release

What's Changed

  • fix: choose cuda arthitectures based on cuda version by @guocuimi in #463
  • kernel: add grouped gemm support for moe by @guocuimi in #458
  • kernel: added oob handling for grouped gemm kernel by @guocuimi in #465
  • refactor: add _1 into stride for contiguous dim by @guocuimi in #466
  • ci: set cuda arch to native for ci workflows by @guocuimi in #467
  • refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in #468
  • refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in #469
  • fix: skip failed unittests for blackwell gpus by @guocuimi in #472
  • feat: added single tile scheduler for attn kernel by @guocuimi in #473
  • feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in #474
  • refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in #475
  • feat: use global residue_mnk for oob handling by @guocuimi in #476
  • feat: simplify mask logic to avoid manual index computation by @guocuimi in #477
  • feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in #478
  • feat: added gtest_main with filters based on compute_capabilities by @guocuimi in #479
  • ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in #481
  • feat: add tma copy for paged kv by @guocuimi in #480
  • feat: added gather tma copy to control smem box size by @guocuimi in #482
  • feat: use aggressive compress-mode for fatbin by @guocuimi in #484
  • feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in #485
  • feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in #483
  • feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in #486
  • refactor: move kernel code into different folders by @guocuimi in #487
  • feat: added KV multi-stages support for attn sm120 by @guocuimi in #489
  • refactor: simplify mha block tiling logic by @guocuimi in #488
  • feat: added smem and gmem layout selector for attn kernel by @guocuimi in #490
  • feat: added args and params for attn kernels by @guocuimi in #491
  • feat: added universal fmha runner by @guocuimi in #492
  • feat: added kernel builder for attn by @guocuimi in #493
  • refactor: change stride for Q/K/V to MNKL by @guocuimi in #494
  • upgrade torch to 12.8 by @guocuimi in #496
  • ci: fix nccl related build error by @guocuimi in #497

Full Changelog: v0.2.5...v0.2.6