Skip to content

Conversation

@yuz207
Copy link

@yuz207 yuz207 commented Oct 15, 2025

Summary

  • Implement safe, per-segment CUDA-graph-based chunk capture with gating on CUDA-graph capture availability for NWOR/SCV. Graph mode is automatically disabled if CUDA graph capture isn’t available, with a log entry. A one-time notice is emitted when a graph capture is created for a key. SCV graph execution coordination now leverages a runner reference to coordinate enablement state and mask computation.
  • Added NWOR profiling harness and tooling to benchmark NWOR with different SCV modes, plus extensive profiling scripts and docs.
  • Updated tests to accommodate new internal API: _compute_nwor_acceptance(spec_decode_metadata, sampled_token_ids, return_mask=False) now returning (counts, mask) and per-segment graph write paths. Tests are guarded to work in CPU-only environments.
  • Backwards-compatible: environments without CUDA graph capture automatically fall back to the non-graph path.
  • Documentation and profiling: NWOR/SCV graph-capable paths and fallbacks documented; new profiling guides and validation results added.

Changes

GPUModelRunner

  • Gate SCV graph usage with _scv_capture_available and _scv_graph_notice_logged flags. Short-circuit validation if SCV mode is graph and graph capture isn’t available.
  • Introduce new API: _compute_nwor_acceptance(spec_decode_metadata, sampled_token_ids, return_mask=False) which returns (counts, mask) where counts is a per-draft-token list and mask is an optional boolean mask tensor when requested.
  • When initializing the SCV graph executor, pass the runner instance and device so the executor can access mask computation and runner state.
  • Expanded graph-capture gating and runner state logging to aid debugging in graph-enabled environments.

SCVGraphExecutor (integrations and caching)

  • Updated integration to coordinate with the GPUModelRunner via a runner reference, enabling enablement state and mask computation coordination.
  • Implement per-key graph caching with a dynamic key based on input characteristics; a warmup send and a single CUDA graph capture per key.
  • Graph execution replays the captured graph and returns the computed mask. Removed old per-entry Python buffers in favor of per-key graph buffers.

DeferredWriteManager (v1/kv_cache/deferred.py)

  • Added helper _slice_scale_segment and per-entry segment tracking to support chunk-based graph captures.
  • Introduced _req_start_offsets to map per-request tokens to their positions in the overall stream.
  • Commit API updated: accepts accepted_counts (a list of per-entry accepted token counts) instead of a boolean mask. The logic computes per-entry segments that can be written directly to the KV cache, enabling safe chunk capture with CUDA graphs.
  • Rework of the commit pipeline to support per-segment writes and to coordinate with graph-based replays for exact token slices.
  • Clean-up of internal state between commits to ensure proper graph-keyed behavior and avoid stale buffers.

DeferredWriteManager tests (v1 tests)

  • Updated tests to exercise _compute_nwor_acceptance(metadata, sampled, return_mask=True) API and per-segment graph write paths.
  • Guard GPUModelRunner import in tests to handle CPU-only environments.

New tooling

  • Added profiling script: tools/profiling/run_nwor_microbench.py to benchmark NWOR with different SCV modes.
  • Added profiling scripts for NWOR/SCV evaluation and additional benchmarking support.
  • Added post-processing script: tools/profiling/post_process_ncu.py for NSight Compute results.

Documentation & validation

  • NWOR/docs updated to reflect that SCV graph mode requires CUDA graph capture support; environments without capture will disable graph mode automatically.
  • Added NWOR Validation Results and SCV Phase 0 Summary docs to capture current status and outcomes.
  • Profiling guide added: PROFILING_GUIDE.md.

Tests

  • Updated tests to reflect new _compute_nwor_acceptance API and per-segment graph write paths.
  • Tests guard CPU-only environments forCUDA/GPU-specific paths.

Miscellaneous

  • Minor platform/interface adjustments to support device auto-selection under certain configurations. See changes in vllm/platforms/interface.py and related areas as part of broader integration work.

Notable API/Behavioral Notes

  • If CUDA graph capture is unavailable or graph mode is disabled at runtime, the system falls back to the non-graph path automatically with appropriate logs.
  • The _compute_nwor_acceptance API now returns (counts, mask) to support both per-entry token counts and optional mask computation for graph-based execution paths.
  • Graph captures are keyed per input characteristics (per-key) to enable reuse and safe chunked graph replay with minimal Python overhead.

…logging

- Introduce _scv_capture_available to check CUDA graph capture support
- Disable SCV graph mode if capture unavailable with info log
- Add _scv_graph_notice_logged to log SCV graph activation once
- Pass capture availability flag to SCVGraphExecutor
- Prevent SCV graph usage if unsupported to avoid errors

Co-authored-by: terragon-labs[bot] <terragon-labs[bot]@users.noreply.github.com>
@yuz207 yuz207 marked this pull request as ready for review October 15, 2025 21:57
@yuz207 yuz207 changed the title NWOR: Gate SCV graph mode on CUDA capture availability NWOR: Graph-based SCV execution gated by CUDA capture Oct 15, 2025
@yuz207 yuz207 changed the title NWOR: Graph-based SCV execution gated by CUDA capture Implement safe chunk capture with CUDA graphs Oct 16, 2025
@yuz207 yuz207 changed the title Implement safe chunk capture with CUDA graphs Implement per-key SCV CUDA graphs with gating and safe capture Oct 16, 2025
@yuz207 yuz207 changed the title Implement per-key SCV CUDA graphs with gating and safe capture Add per-key SCV CUDA graphs with gating on capture availability Oct 16, 2025
@yuz207 yuz207 changed the title Add per-key SCV CUDA graphs with gating on capture availability Implement safe chunk capture with CUDA graphs for SCV (NWOR) Oct 16, 2025
@yuz207 yuz207 changed the title Implement safe chunk capture with CUDA graphs for SCV (NWOR) Add per-key CUDA-graph safe chunk capture for SCV/NWOR (profiling) Oct 16, 2025
yuz207 and others added 13 commits October 16, 2025 16:53
- Replace AsyncLLMEngine with synchronous LLM for lower overhead
- Add configurable tensor_parallel_size parameter (default: 1)
- Fix sampling_params serialization (manual dict vs non-existent to_dict)
- Replace engine.shutdown() with explicit cleanup (del + gc.collect())
- Reduces async scheduling overhead for cleaner NWOR/SCV measurements
- Move shape validation from device to host side
- Add graceful fallback on invalid sampled_token_ids shape
- Log warning_once when clamping will be applied
- Remove redundant RuntimeError checks incompatible with graph mode
- Improve _scv_compute_mask documentation
- test_scv_mask_handles_oob_gracefully: reproduces OOB scenario
- test_scv_mask_all_oob: extreme case with empty sampled tensor
- test_scv_mask_invalid_shape_falls_back: validates fallback on bad shapes
- All tests pass with host-side validation + clamping fix
Baseline run with EAGLE spec decode on Llama-3.2-3B:
- All SCV modes (off/graph/adaptive) complete without errors
- No CUDA device asserts or crashes
- Host-side validation prevents OOB access
- Latency ranges 0.59-0.61s per batch (8 reqs, 32 tokens)

Note: Spec decode metrics are zero (configuration issue, not SCV bug).
The important result is stability across all modes with the clamping fix.
- Add VLLM_NWOR_DEBUG environment variable to enable verbose logging
- Log NWOR/SCV configuration on init when spec decode is enabled
- Trace window lifecycle: begin, finalize, commit, cancel
- Show acceptance counts and per-request breakdown
- All debug output guarded by VLLM_NWOR_DEBUG=1 flag

Usage:
  VLLM_NWOR_DEBUG=1 python tools/profiling/run_nwor_microbench.py ...
Summary:
- NWOR proven functional: 92 windows, 2024 draft tokens, 205 committed
- ~90% write savings from rejected tokens (1819 avoided writes)
- Zero metrics mystery solved: harness instrumentation artifact
- SCV vectorized path stable across all modes
- Phase 0 complete: production ready

Debug run proves end-to-end functionality with EAGLE spec decode.
Initial baseline zeros were due to metrics isolation between engine
instances, not implementation bugs.
@yuz207 yuz207 changed the title Safe per-key CUDA-graph chunk capture for SCV/NWOR with gating NWOR/SCV: safe per-key CUDA-graph capture with gating and profiling Oct 19, 2025
@yuz207 yuz207 force-pushed the performance-fixes branch 4 times, most recently from 5c61860 to 3d14814 Compare October 19, 2025 01:54
- Add fix_ncu_permissions.sh for NCU permission management
- Add tools/profiling/post_process_ncu.py for NCU data analysis
- Add vllm/v1/sample/random_utils.py for random sampling utilities
- Remove obsolete SCV baseline files
@yuz207 yuz207 force-pushed the performance-fixes branch from 3d14814 to 662e918 Compare October 19, 2025 02:08
Previously, newly captured graph entries would immediately call replay()
which could fail and cause the entry to be removed from the cache even
though capture succeeded. This left the cache empty.

Now newly captured entries use their mask buffer directly without replay,
while cached entries call replay() as expected. Also broadened exception
handling from RuntimeError to Exception to catch all graph failures.
@yuz207 yuz207 changed the title NWOR/SCV: safe per-key CUDA-graph capture with gating and profiling NWOR/SCV: safe chunked CUDA-graph capture with gating and profiling Oct 19, 2025
This commit implements five correctness-preserving optimizations that
reduce GPU-CPU synchronization overhead in speculative decoding paths
without changing behavior. Estimated total speedup: 5-11ms per decode step.

Optimization #1: Batch mask sum operations (⭐⭐⭐)
- Before: N GPU-CPU syncs (one per request) via .sum().item() in loop
- After: Single batched sync via torch.stack().cpu() for all requests
- Impact: Reduces 4-8ms overhead to ~0.5ms for typical batch sizes
- Locations: Lines 2712-2740 (SCV path), 2757-2829 (fallback path)
- Safety: Guards against empty sum_tensors to prevent stacking errors

Optimization #2: Eliminate CPU transfer in SCV cache key (⭐⭐⭐)
- Before: cu_int32.cpu().tolist() forces GPU->CPU sync on every SCV call
- After: Use itertools.accumulate() to compute cumsum directly on CPU
- Impact: Removes 0.5-2ms overhead per SCV call, even for cache hits
- Location: Lines 2893-2900
- Safety: Uses spec_decode_metadata.num_draft_tokens (already CPU list)

Optimization #3: Combine device/dtype conversions (⭐⭐)
- Before: Two sequential .to() calls launch two separate kernels
- After: Single .to(device=..., dtype=...) launches one combined kernel
- Impact: 2x faster conversions (~0.3ms saved)
- Locations: Lines 2749-2750, 2882-2883
- Safety: PyTorch API guarantees identical behavior for combined .to()

Optimization #4: Hoist device/dtype checks outside loop (⭐⭐)
- Before: Per-request device/dtype checks and conversions inside loop
- After: Single conversion before loop (tensor slices inherit properties)
- Impact: Eliminates 0.1-0.5ms per-request overhead
- Location: Lines 2771-2772 (moved from inside loop at 2782-2785)
- Safety: PyTorch guarantees all rows share parent tensor's device/dtype

Optimization #5: Cache _nwor_debug lookup (⭐)
- Before: Duplicate getattr() calls at lines 2640 and 2644
- After: Single lookup cached in local variable
- Impact: Negligible performance, cleaner code
- Location: Line 2639
- Safety: Trivial refactor with identical semantics

All optimizations maintain exact correctness while eliminating redundant
GPU-CPU synchronization points and duplicate kernel launches. No changes
to NWOR/SCV algorithms or numerical results.
…ensive cache check

Issue #1: Replace encoder cache assertion with explicit exception (line 2172)
- Before: assert encoder_output is not None, f"Encoder cache miss..."
- After: if encoder_output is None: raise ValueError(...)
- Rationale: Assertions can be disabled with python -O, making them
  unsuitable for runtime validation. Explicit exceptions ensure the
  cache miss is always caught, even in optimized mode.
- Impact: Improves robustness with zero behavior change in normal execution

Issue #2: Add defensive check to cache eviction (line 457)
- Before: if len(cache) < max_entries: return
- After: if not cache or len(cache) < max_entries: return
- Rationale: Prevents ValueError from min() when cache is empty and
  max_entries=0. Though current code always uses max_entries=32, this
  defensive check prevents potential edge case failures.
- Impact: Improves code robustness at zero runtime cost

Both fixes are purely defensive - they don't change behavior in normal
operation but prevent potential issues in edge cases or when assertions
are disabled.
@yuz207 yuz207 changed the title NWOR/SCV: safe chunked CUDA-graph capture with gating and profiling NWOR/SCV: gated CUDA-graph capture with per-key caching and profiling Oct 19, 2025
This commit addresses the remaining issues found in the comprehensive
end-to-end audit, preparing the code for vLLM PR submission.

## Correctness Fix

**Add input validation for draft_token_ids shape** (gpu_model_runner.py:2716-2724)
- Validates spec_decode_metadata.draft_token_ids.shape[0] == sum(num_draft_tokens)
- Prevents cryptic tensor shape errors if scheduler provides inconsistent metadata
- Returns all-zeros gracefully with clear error log instead of crashing mid-loop
- Defensive programming - should never trigger with correct scheduler

## Code Quality Improvements

**Remove duplicate import** (gpu_model_runner.py:2917)
- Removed inline `import itertools` (already imported at top of file)
- Follows PEP 8 import conventions

**Remove dead code** (gpu_model_runner.py:806)
- Removed unused `self._scv_graph_executor = None` leftover from refactoring
- Cleaner codebase

**Extract magic number to constant** (gpu_model_runner.py:465, 2941-2943)
- Defined `_SCV_GRAPH_CACHE_MAX_SIZE = 32` as class constant
- Self-documenting, easier to tune for different workloads

**Remove redundant defensive check** (gpu_model_runner.py:819-820)
- Removed `hasattr(self, "_scv_mode")` check in hot path
- `_scv_mode` is always set in __init__, check is unnecessary
- Micro-optimization in method called every decode step

**Fix metrics calculation** (deferred.py:415-428)
- Changed from counting writes (committed_total) to counting accepted tokens
- Before: rejected = expected - (writes across all layers) → often negative
- After: rejected = expected - sum(accepted_counts) → correct semantics
- Fixes misleading metrics without affecting correctness

## Documentation

**Add comprehensive docstring** (gpu_model_runner.py:2699-2710)
- Documents _compute_nwor_acceptance parameters, return values, and behavior
- Improves code maintainability for future contributors

---

All changes are correctness-preserving except the defensive validation guard,
which prevents crashes from malformed scheduler metadata. Code is now
production-ready for vLLM PR submission.
@yuz207 yuz207 changed the title NWOR/SCV: gated CUDA-graph capture with per-key caching and profiling NWOR/SCV: Safe chunk capture with CUDA graphs Oct 19, 2025
yuz207 and others added 15 commits October 20, 2025 03:08
…-vectorization-godcn2

Vectorize NWOR commit: per-layer staged tokens in DeferredWriteManager
Reduce per-window overhead through targeted optimizations:

1. Remove redundant dtype conversion in _slice_scale()
   - Caller guarantees int64 indices, eliminating 52 checks per window

2. Remove redundant _ensure_int32_slots() in full acceptance path
   - slot_mapping already ensured int32/contiguous during staging

3. Cache key_cache/value_cache storage check
   - All layers in same forward pass share cache properties
   - Check once per window instead of 52 times

4. Cache full_window flag
   - Compute during staging, avoiding 52 comparisons at commit

5. Cache os.getenv() result
   - Read debug flag once at initialization instead of per window

All optimizations preserve correctness and are based on verified
invariants. Expected reduction: ~1.1ms per window (~6% improvement).
The _in_restricted_context() check in stage_layer() is redundant because:
1. begin_window() already checks and returns False if in restricted context
2. stage_layer() guards with _window_active which can only be True if begin_window() succeeded
3. Main model CUDA graph is explicitly disabled when NWOR is active (gpu_model_runner.py:3421-3430)
4. SCV graph capture happens after forward pass completes, not during stage_layer() execution

This removes 26 redundant CUDA API calls per NWOR window, saving ~0.3-1.3ms overhead.
@yuz207 yuz207 merged commit 84ff352 into main Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant