Skip to content

[RFC]: Reuse multimodal embeddings from encoder cache #21113

@knlnguyen1802

Description

@knlnguyen1802

🚀 The feature, motivation and pitch


Motivation

In real-world multimodal workflows (vision chat, RAG-with-images, agent loops) the same image or audio clip is often reused across many prompts.
Today, vLLM re-encodes identical media on different request, wasting:

  • Encoder + projector compute
  • GPU memory bandwidth
  • Scheduler slots

Re-using cached embeddings will cut end-to-end latency and boost throughput.


High-Level Plan

  1. Stable key per media asset
    Compute a deterministic mm_hash (e.g., SHA-256 of raw bytes or deterministic pre-processor output) for every image / audio / video frame.

  2. LRU Multimodal Embedding Cache
    GPU-resident key → tensor store with configurable capacity and LRU eviction.

  3. Scheduler & EncoderCacheManager changes
    • On request arrival, Scheduler queries EncoderCacheManager with each mm_hash.
    • Cache hit ⇒ skip Encoder + Projector, fetch embeddings.
    • Cache miss ⇒ run Encoder + Projector and write back.

  4. Memory control & eviction
    EncoderCacheManager owns allocation, frees memory, and evicts least-recently-used entries under pressure.

  5. Backward compatibility
    Cache is opt-in (multimodal_cache.enabled=false by default); text-only and existing multimodal users see no change.

Noted: For requests contain many images, audio, each mm_data will treated separately and cached separately, the decoder will retrieve cache and reorder base on the request nedded


Architecture Diagrams

1. Existing data flow (high-level) Image
2. Propose change control flow with EncoderCacheManager Image

Plan Implemented Table

Component Change Notes
AsyncLLM/Processor Emit mm_hashes next to token IDs & features Hash computed once during preprocessing
Scheduler Lookup mm_hashes; skip encoder on hit Must handle intra- and inter-batch reuse
EncoderCacheManager get_or_put(mm_hash, embeddings); LRU; memory guard Config: max_mm_cache_bytes, device={cpu,gpu}
Encoder Runs only on cache miss No change or minimal change when writing MM cache
Decoder Unchanged No change or minimal change when reading MM cache
LRUMMCache New LRU key→tensor store Drop-in replacement for current MM cache; backward-compatible, opt-in, handles eviction & memory limit

@ywang96 cc for request feature from #4194

References: https://docs.google.com/document/d/11_DFQTku6C2aV6ghK21P76ST6uAUVjMlEjs54prtb_g/edit?tab=t.0#heading=h.635zp481pbum

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions