-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
🚀 The feature, motivation and pitch
Motivation
In real-world multimodal workflows (vision chat, RAG-with-images, agent loops) the same image or audio clip is often reused across many prompts.
Today, vLLM re-encodes identical media on different request, wasting:
- Encoder + projector compute
- GPU memory bandwidth
- Scheduler slots
Re-using cached embeddings will cut end-to-end latency and boost throughput.
High-Level Plan
-
Stable key per media asset
Compute a deterministicmm_hash
(e.g., SHA-256 of raw bytes or deterministic pre-processor output) for every image / audio / video frame. -
LRU Multimodal Embedding Cache
GPU-resident key → tensor store with configurable capacity and LRU eviction. -
Scheduler & EncoderCacheManager changes
• On request arrival, Scheduler queries EncoderCacheManager with eachmm_hash
.
• Cache hit ⇒ skip Encoder + Projector, fetch embeddings.
• Cache miss ⇒ run Encoder + Projector and write back. -
Memory control & eviction
EncoderCacheManager owns allocation, frees memory, and evicts least-recently-used entries under pressure. -
Backward compatibility
Cache is opt-in (multimodal_cache.enabled=false
by default); text-only and existing multimodal users see no change.
Noted: For requests contain many images, audio, each mm_data will treated separately and cached separately, the decoder will retrieve cache and reorder base on the request nedded
Architecture Diagrams
Plan Implemented Table
Component | Change | Notes |
---|---|---|
AsyncLLM/Processor |
Emit mm_hashes next to token IDs & features |
Hash computed once during preprocessing |
Scheduler |
Lookup mm_hash es; skip encoder on hit |
Must handle intra- and inter-batch reuse |
EncoderCacheManager |
get_or_put(mm_hash, embeddings) ; LRU; memory guard |
Config: max_mm_cache_bytes , device={cpu,gpu} |
Encoder |
Runs only on cache miss | No change or minimal change when writing MM cache |
Decoder |
Unchanged | No change or minimal change when reading MM cache |
LRUMMCache |
New LRU key→tensor store | Drop-in replacement for current MM cache; backward-compatible, opt-in, handles eviction & memory limit |
@ywang96 cc for request feature from #4194
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status