Context parallel implementation for Mamba 2

I have a working context parallel implementation forked from this repo for forward/backward passes which required two modifications

1. padding conv layer input chunks on each GPU with the last N_padding tokens of the previous GPU and then discarding padding token output indices
2. transferring final states in [state passing](https://github.com/state-spaces/mamba/blob/bc84fb1172e6dea04a7dc402118ed19985349e95/mamba_ssm/ops/triton/ssd_combined.py#L317) point-to-point between GPUs sequentially

And then vice-a-versa for the backward pass. I believe I've also worked out a way to do this without sequential point-to-point. 

Would this be useful to contribute? If so, would like to know best way to do so since it requires modification of the core wrapper of the mamba 2 triton code.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context parallel implementation for Mamba 2 #597

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context parallel implementation for Mamba 2 #597

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions