KV_Cache_Optimization/
analysis/ # Analysis utilities
colbert_index/ # ColBERT/RAGatouille indices (generated)
configs/ # Optional configuration files
data/ # Optional sample data
example/ # Example usage artifacts
inputs/ # Provided datasets
results/
analysis/ # Analysis outputs
decoding/ # Speculative decode traces
retrieval/ # Retrieval outputs (top-k indices)
kv_caches/ # Saved KV cache entries (per-chunk folders)
scripts/
analysis/
retrieval/
run_retrieval.sh # Run RAG indexing + retrieval
build_kv_cache.sh # Build per-sample KV caches from retrieval JSON
decoding/
run_speculative_decode.sh # Run speculative decode with promotions
run_token_budget_test.sh
src/
build_kv_cache.py # Build GPU/CPU KV caches from retrieval output
config.py # Pipeline configuration utilities
kv_cache_manager.py # CPU/GPU cache manager with CacheBlend kernels
rag_retrieval.py # RAGatouille (ColBERT) indexing + retrieval
run_pipeline.py # End-to-end demo pipeline
speculative_decode.py # Speculative decode with proactive promotions
token_budget_calculator.py
utils/
vllm_blend/
requirements.txt
README.md
- Python requirements (minimal): see
requirements.txt. - Additional dependencies:
- Retrieval uses RAGatouille/ColBERT:
ragatouilleand its dependencies. - Transformers for model loading:
transformers. - tqdm for progress bars.
- Retrieval uses RAGatouille/ColBERT:
Example installation:
cd KV_Cache_Optimization/vllm_blend
pip install -e .
cd ..
pip install -r requirements.txtEnsure you have access to the target HF model (eg, meta-llama/Meta-Llama-3-8B) and appropriate GPU/CPU memory.
Sample inputs are provided under KV_Cache_Optimization/inputs/, eg:
musique_s.jsonwikimqa_s.jsonsamsum.json
There are three main steps:
- Retrieval (build RAG index and compute top-k per sample)
- Script:
scripts/retrieval/run_retrieval.sh - Writes:
results/retrieval/<dataset>_rag_both_k<k>.json
cd KV_Cache_Optimization
bash scripts/retrieval/run_retrieval.sh
# Default dataset: inputs/musique_s.json
# Output: results/retrieval/musique_s_rag_both_k5.json- Build KV Caches (prefill top-K on GPU, placeholders on CPU)
- Script:
scripts/build_kv_cache.sh - Reads: retrieval JSON from step 1
- Writes:
- Summary:
results/kv_caches/musique_s_kv_top5.json(default) - Per-chunk KV folders under
results/kv_caches/when--save-cache-diris enabled (default in script)
- Summary:
cd KV_Cache_Optimization
bash scripts/build_kv_cache.sh
# Output summary: results/kv_caches/musique_s_kv_top5.json
# Saved chunk KV: results/kv_caches/<sample_chunk_id>/{keys.pt,values.pt,valid_mask.pt,metadata.json}- Speculative Decode
- Script:
scripts/decoding/run_speculative_decode.sh - Reads:
- Retrieval JSON (from step 1)
- Optionally loads cached KV from
results/kv_caches/via--load-cache-dir
- Writes:
results/decoding/speculative_trace.json
cd KV_Cache_Optimization
bash scripts/decoding/run_speculative_decode.sh
# Output: results/decoding/speculative_trace.json- Retrieval outputs:
KV_Cache_Optimization/results/retrieval/- e.g
musique_s_rag_both_k5.json
- e.g
- KV cache summary and per-chunk KV:
KV_Cache_Optimization/results/kv_caches/- e.g
musique_s_kv_top5.json, plus per-chunk folders
- e.g
- Speculative decode trace and answers:
KV_Cache_Optimization/results/decoding/- e.g
speculative_trace.json
- e.g
- Default values in the provided shell scripts can be overridden via environment variables.
- Ensure sufficient GPU/CPU memory for the chosen model and
top-k. - If CacheBlend kernels are required (
require_kernels=TrueinKVCacheManager), ensurevllm_blendis importable.
cd KV_Cache_Optimization
# 1) Retrieval
bash scripts/retrieval/run_retrieval.sh
# 2) Build KV caches
bash scripts/build_kv_cache.sh
# 3) Speculative decode
bash scripts/decoding/run_speculative_decode.sh