[RFC]: Reuse multimodal embeddings from encoder cache

### 🚀 The feature, motivation and pitch

---

#### Motivation
In real-world multimodal workflows (vision chat, RAG-with-images, agent loops) the *same* image or audio clip is often reused across many prompts.  
Today, vLLM re-encodes identical media on different request, wasting:

* Encoder + projector compute 
* GPU memory bandwidth
* Scheduler slots

Re-using cached embeddings will cut end-to-end latency and boost throughput.

---

#### High-Level Plan
1. **Stable key per media asset**  
   Compute a deterministic `mm_hash` (e.g., SHA-256 of raw bytes or deterministic pre-processor output) for every image / audio / video frame.

2. **LRU Multimodal Embedding Cache**  
   GPU-resident key → tensor store with configurable capacity and LRU eviction.

3. **Scheduler & EncoderCacheManager changes**  
   • On request arrival, Scheduler queries EncoderCacheManager with each `mm_hash`.  
   • Cache hit ⇒ skip Encoder + Projector, fetch embeddings.  
   • Cache miss ⇒ run Encoder + Projector and write back.

4. **Memory control & eviction**  
   EncoderCacheManager owns allocation, frees memory, and evicts least-recently-used entries under pressure.

5. **Backward compatibility**  
   Cache is opt-in (`multimodal_cache.enabled=false` by default); text-only and existing multimodal users see no change.

Noted: For requests contain many images, audio, each mm_data will treated separately and cached separately, the decoder will retrieve cache and reorder base on the request nedded

---
#### Architecture Diagrams
<details>
<summary>1. Existing data flow (high-level)</summary>

<img width="624" height="239" alt="Image" src="https://github.com/user-attachments/assets/73ffcf12-f9dc-446a-af50-1dd4285a76ba" />
</details>

<details>
<summary>2. Propose change control flow with EncoderCacheManager</summary>

<img width="865" height="674" alt="Image" src="https://github.com/user-attachments/assets/9d59273c-6327-4a5d-b871-195407b5b871" />

</details>

#### Plan Implemented Table

| Component               | Change                                                            | Notes                                                                                                   |
|-------------------------|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|
| `AsyncLLM/Processor`    | Emit `mm_hashes` next to token IDs & features                     | Hash computed once during preprocessing                                                                 |
| `Scheduler`             | Lookup `mm_hash`es; skip encoder on hit                           | Must handle intra- and inter-batch reuse                                                                |
| `EncoderCacheManager`   | `get_or_put(mm_hash, embeddings)`; LRU; memory guard              | Config: `max_mm_cache_bytes`, `device={cpu,gpu}`                                                        |
| `Encoder`               | Runs only on cache miss                                           | No change or minimal change when writing MM cache                                                       |
| `Decoder`               | Unchanged                                                         | No change or minimal change when reading MM cache                                                       |
| `LRUMMCache`            | New LRU key→tensor store                                          | Drop-in replacement for current MM cache; backward-compatible, opt-in, handles eviction & memory limit |

@ywang96 cc for request feature from https://github.com/vllm-project/vllm/issues/4194

References: https://docs.google.com/document/d/11_DFQTku6C2aV6ghK21P76ST6uAUVjMlEjs54prtb_g/edit?tab=t.0#heading=h.635zp481pbum


### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Reuse multimodal embeddings from encoder cache #21113

🚀 The feature, motivation and pitch

Motivation

High-Level Plan

Architecture Diagrams

Plan Implemented Table

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Change	Notes
`AsyncLLM/Processor`	Emit `mm_hashes` next to token IDs & features	Hash computed once during preprocessing
`Scheduler`	Lookup `mm_hash`es; skip encoder on hit	Must handle intra- and inter-batch reuse
`EncoderCacheManager`	`get_or_put(mm_hash, embeddings)`; LRU; memory guard	Config: `max_mm_cache_bytes`, `device={cpu,gpu}`
`Encoder`	Runs only on cache miss	No change or minimal change when writing MM cache
`Decoder`	Unchanged	No change or minimal change when reading MM cache
`LRUMMCache`	New LRU key→tensor store	Drop-in replacement for current MM cache; backward-compatible, opt-in, handles eviction & memory limit

Uh oh!

[RFC]: Reuse multimodal embeddings from encoder cache #21113

Description

🚀 The feature, motivation and pitch

Motivation

High-Level Plan

Architecture Diagrams

Plan Implemented Table

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions