Skip to content

Question: Does paged attention demonstrate prefix sharing?  #2354

@bob-just-bob

Description

@bob-just-bob

Reading https://arxiv.org/abs/2311.04934 and wondering if I would gain anything from prompt cache.

My use case is having prompts with overlaping prefixes (mostly a few big ones). And I already use vllm paged attention.

Assuming I would only want to cache kv states for prefixes (not positioned anywhere like in the paper).
Would there be any gains in caching attention prefix states, or is paged attention and vllm indeed already doing this?

Paper:

Paged attention also demonstrates simple prefix sharing,
where different prompts with an identical prefix share
KV Cache

Goal:

                                           shared inputs with prompt1
                                               |
                                               |
 +---------------------------------+     +-----+------+--------------------+
 |                                 | ... | ////|///// |                    |
 +---------------------------------+     +------------+--------------------+
  prompt 1                                           prompt 2
  request 1                                          request 2


- store prefix->kvs
- request
  - find shared inputs
  - assert_kv_cache(prefix-kvs)


Any gain from this idea?

So do we with paged attention already skip the attention for the shared inputs, or is there anything to be gainend from
additionally caching prefix kvs?

If it already caches across requests, what is the mechanism that keeps kv-cache entries from busting?
Wondering if there are still potential tweaks to make to make sure certain prefixes stay in kv-cache.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions