-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Description
Reading https://arxiv.org/abs/2311.04934 and wondering if I would gain anything from prompt cache.
My use case is having prompts with overlaping prefixes (mostly a few big ones). And I already use vllm paged attention.
Assuming I would only want to cache kv states for prefixes (not positioned anywhere like in the paper).
Would there be any gains in caching attention prefix states, or is paged attention and vllm indeed already doing this?
Paper:
Paged attention also demonstrates simple prefix sharing,
where different prompts with an identical prefix share
KV Cache
Goal:
shared inputs with prompt1
|
|
+---------------------------------+ +-----+------+--------------------+
| | ... | ////|///// | |
+---------------------------------+ +------------+--------------------+
prompt 1 prompt 2
request 1 request 2
- store prefix->kvs
- request
- find shared inputs
- assert_kv_cache(prefix-kvs)
Any gain from this idea?
So do we with paged attention already skip the attention for the shared inputs, or is there anything to be gainend from
additionally caching prefix kvs?
If it already caches across requests, what is the mechanism that keeps kv-cache entries from busting?
Wondering if there are still potential tweaks to make to make sure certain prefixes stay in kv-cache.