-
Notifications
You must be signed in to change notification settings - Fork 35
Closed
Description
First, congratulations on your outstanding work. However, I checked the paper and code and noticed some minor improvements (or I failed to notice my mistakes). From the code, I believe the scores are always static, and hence the KV-cache can be reduced to linear, because the evicted KV will never be used again.
Below is a simple proof.
Denote Score (before topK) as
Denote Selected Index (Mask) as
Obviously,
Which means that for each input, at most one index can be evicted, and no evicted tokens will be selected.
Therefore, during inference, you only need to maintain a KV cache of size window_size.
Copilot and LoserCheems
Metadata
Metadata
Labels
No labels