Release v0.2.6 · vectorch-ai/ScaleLLM

What's Changed

fix: choose cuda arthitectures based on cuda version by @guocuimi in #463
kernel: add grouped gemm support for moe by @guocuimi in #458
kernel: added oob handling for grouped gemm kernel by @guocuimi in #465
refactor: add _1 into stride for contiguous dim by @guocuimi in #466
ci: set cuda arch to native for ci workflows by @guocuimi in #467
refactor: move TileShape into launch_mha_kernel_sm80 by @guocuimi in #468
refactor: split attention kernel into collective mainloop, collective epilogue and kernel by @guocuimi in #469
fix: skip failed unittests for blackwell gpus by @guocuimi in #472
feat: added single tile scheduler for attn kernel by @guocuimi in #473
feat: add tile scheduler for grouped gemm and refactor gemm kernel by @guocuimi in #474
refactor: split mla kernels into collective_mla collective_epilogue by @guocuimi in #475
feat: use global residue_mnk for oob handling by @guocuimi in #476
feat: simplify mask logic to avoid manual index computation by @guocuimi in #477
feat: added static permistent tile scheduler with swizzle and rasterize by @guocuimi in #478
feat: added gtest_main with filters based on compute_capabilities by @guocuimi in #479
ci: upgrade cutlass to v4.1 and switch to forked repo by @guocuimi in #481
feat: add tma copy for paged kv by @guocuimi in #480
feat: added gather tma copy to control smem box size by @guocuimi in #482
feat: use aggressive compress-mode for fatbin by @guocuimi in #484
feat: added fast StaticPersistentTileScheduler for 1d tma multicast by @guocuimi in #485
feat: [1/n] added sm120 fmha using collective async copy by @guocuimi in #483
feat: [2/n] added warp specialization kernel for sm120 fmha by @guocuimi in #486
refactor: move kernel code into different folders by @guocuimi in #487
feat: added KV multi-stages support for attn sm120 by @guocuimi in #489
refactor: simplify mha block tiling logic by @guocuimi in #488
feat: added smem and gmem layout selector for attn kernel by @guocuimi in #490
feat: added args and params for attn kernels by @guocuimi in #491
feat: added universal fmha runner by @guocuimi in #492
feat: added kernel builder for attn by @guocuimi in #493
refactor: change stride for Q/K/V to MNKL by @guocuimi in #494
upgrade torch to 12.8 by @guocuimi in #496
ci: fix nccl related build error by @guocuimi in #497

Full Changelog: v0.2.5...v0.2.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.2.6

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

Contributors

Uh oh!