Skip to content

This project develops a high-performance KV-cache management framework for multi-document RAG tasks. It focuses on reducing time-per-output-token (TPOT) and improving throughput through adaptive cache scheduling, GPU–CPU offloading, and reuse of cross-document attention states.

Notifications You must be signed in to change notification settings

Chelsi-create/KV_Cache_Optimization

Repository files navigation

Repository Structure

KV_Cache_Optimization/
  analysis/                # Analysis utilities
  colbert_index/           # ColBERT/RAGatouille indices (generated)
  configs/                 # Optional configuration files
  data/                    # Optional sample data
  example/                 # Example usage artifacts
  inputs/                  # Provided datasets
  results/
    analysis/              # Analysis outputs
    decoding/              # Speculative decode traces
    retrieval/             # Retrieval outputs (top-k indices)
    kv_caches/             # Saved KV cache entries (per-chunk folders)
  scripts/
    analysis/
    retrieval/
      run_retrieval.sh     # Run RAG indexing + retrieval
    build_kv_cache.sh      # Build per-sample KV caches from retrieval JSON
    decoding/
      run_speculative_decode.sh  # Run speculative decode with promotions
    run_token_budget_test.sh
  src/
    build_kv_cache.py      # Build GPU/CPU KV caches from retrieval output
    config.py              # Pipeline configuration utilities
    kv_cache_manager.py    # CPU/GPU cache manager with CacheBlend kernels
    rag_retrieval.py       # RAGatouille (ColBERT) indexing + retrieval
    run_pipeline.py        # End-to-end demo pipeline 
    speculative_decode.py  # Speculative decode with proactive promotions
    token_budget_calculator.py
  utils/
  vllm_blend/
  requirements.txt
  README.md

Environment Setup

  • Python requirements (minimal): see requirements.txt.
  • Additional dependencies:
    • Retrieval uses RAGatouille/ColBERT: ragatouille and its dependencies.
    • Transformers for model loading: transformers.
    • tqdm for progress bars.

Example installation:

cd KV_Cache_Optimization/vllm_blend
pip install -e .
cd ..
pip install -r requirements.txt

Ensure you have access to the target HF model (eg, meta-llama/Meta-Llama-3-8B) and appropriate GPU/CPU memory.

Datasets

Sample inputs are provided under KV_Cache_Optimization/inputs/, eg:

  • musique_s.json
  • wikimqa_s.json
  • samsum.json

End-to-End Workflow

There are three main steps:

  1. Retrieval (build RAG index and compute top-k per sample)
  • Script: scripts/retrieval/run_retrieval.sh
  • Writes: results/retrieval/<dataset>_rag_both_k<k>.json
cd KV_Cache_Optimization
bash scripts/retrieval/run_retrieval.sh
# Default dataset: inputs/musique_s.json
# Output: results/retrieval/musique_s_rag_both_k5.json
  1. Build KV Caches (prefill top-K on GPU, placeholders on CPU)
  • Script: scripts/build_kv_cache.sh
  • Reads: retrieval JSON from step 1
  • Writes:
    • Summary: results/kv_caches/musique_s_kv_top5.json (default)
    • Per-chunk KV folders under results/kv_caches/ when --save-cache-dir is enabled (default in script)
cd KV_Cache_Optimization
bash scripts/build_kv_cache.sh
# Output summary: results/kv_caches/musique_s_kv_top5.json
# Saved chunk KV: results/kv_caches/<sample_chunk_id>/{keys.pt,values.pt,valid_mask.pt,metadata.json}
  1. Speculative Decode
  • Script: scripts/decoding/run_speculative_decode.sh
  • Reads:
    • Retrieval JSON (from step 1)
    • Optionally loads cached KV from results/kv_caches/ via --load-cache-dir
  • Writes: results/decoding/speculative_trace.json
cd KV_Cache_Optimization
bash scripts/decoding/run_speculative_decode.sh
# Output: results/decoding/speculative_trace.json

Where to Find Results

  • Retrieval outputs: KV_Cache_Optimization/results/retrieval/
    • e.g musique_s_rag_both_k5.json
  • KV cache summary and per-chunk KV: KV_Cache_Optimization/results/kv_caches/
    • e.g musique_s_kv_top5.json, plus per-chunk folders
  • Speculative decode trace and answers: KV_Cache_Optimization/results/decoding/
    • e.g speculative_trace.json

Notes

  • Default values in the provided shell scripts can be overridden via environment variables.
  • Ensure sufficient GPU/CPU memory for the chosen model and top-k.
  • If CacheBlend kernels are required (require_kernels=True in KVCacheManager), ensure vllm_blend is importable.

Quickstart

cd KV_Cache_Optimization
# 1) Retrieval
bash scripts/retrieval/run_retrieval.sh
# 2) Build KV caches
bash scripts/build_kv_cache.sh
# 3) Speculative decode
bash scripts/decoding/run_speculative_decode.sh

About

This project develops a high-performance KV-cache management framework for multi-document RAG tasks. It focuses on reducing time-per-output-token (TPOT) and improving throughput through adaptive cache scheduling, GPU–CPU offloading, and reuse of cross-document attention states.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •