Skip to content

[PERFORMANCE] Accerlerating KV-Cache during Inference #141

@yiyousong

Description

@yiyousong

First, congratulations on your outstanding work. However, I checked the paper and code and noticed some minor improvements (or I failed to notice my mistakes). From the code, I believe the scores are always static, and hence the KV-cache can be reduced to linear, because the evicted KV will never be used again.

Below is a simple proof.

Denote Score (before topK) as $S=f(V)$
Denote Selected Index (Mask) as $M_N=TopK(S_{1:N}) $
Obviously, $M_N=TopK(S_{1:N})=TopK(TopK([S_{1:N-1}),S_{N}])=TopK(M_{N-1},S_N)$
Which means that for each input, at most one index can be evicted, and no evicted tokens will be selected.

Therefore, during inference, you only need to maintain a KV cache of size window_size.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions