[PERFORMANCE] Accerlerating KV-Cache during Inference

First, congratulations on your outstanding work. However, I checked the paper and code and noticed some minor improvements (or I failed to notice my mistakes). From the code, I believe the scores are always static, and hence the KV-cache can be reduced to linear, because **the evicted KV will never be used again**.

Below is a simple proof.

Denote Score (before topK) as $S=f(V)$
Denote Selected Index (Mask) as $M_N=TopK(S_{1:N}) $
Obviously, $M_N=TopK(S_{1:N})=TopK(TopK([S_{1:N-1}),S_{N}])=TopK(M_{N-1},S_N)$
Which means that for each input, at most one index can be evicted, and no evicted tokens will be selected.

Therefore, during inference, you only need to maintain a KV cache of size window_size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PERFORMANCE] Accerlerating KV-Cache during Inference #141

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PERFORMANCE] Accerlerating KV-Cache during Inference #141

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions