-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
🚀 The feature, motivation and pitch
This paper might be of interest: https://arxiv.org/pdf/2306.14048.pdf
This paper mentions removing a small portion of the KV cache does not affect the results but improves memory efficiency. A 20% reduction in Heavy Hitters (H2) can increase throughput by 29 times and reduce latency by 1.9 times.
When calculating attention scores, a small portion of tokens contribute the majority of the value. This paper proposes the Heavy Hitter Oracle (H2O), a KV cache eviction strategy that dynamically balances retention between recent tokens and H2 tokens. They frame the eviction problem of the KV cache as a dynamic submodular problem.
Trade-off:
Deleting some deemed unimportant KV cache may raise concerns about accuracy, but it reduces memory usage to improve throughput.
@simon-mo Is this a feature you'd like to see implemented?
Alternatives
No response
Additional context
No response