-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
SGLang Roadmap — 2025 Q4
Contributions and feedback are welcome. Join Slack.
Focus
- Feature compatibility & reliability: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, HiCache, and load balancing.
- Usability: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
- Kernel optimization for next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
- Reinforcement learning framework integration and training-inference mismatch mitigation.
Base Engine Features
-
Overlap scheduler compatibility with speculative decoding and all features
PoC: @hnyls2002
Slack: #spec-decoding
Issue: [Feature] Overlap Spec Support #11762 -
Prefill CUDA graph
PoC: @Oasis-Git @ispobock @BBuf
Slack: #piecewise-cuda-graph
Issue: [Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490 -
Memory cache v2 refactor
PoC: @cctry @xiezhq-hermann
Slack: #prefix-cache, #kv-cache-store
Issue: [Feature] Memory Cache System Refactoring Road Map (Mem Cache V2) #12587 -
Torch compile stack (need PoC)
Slack: #torch-compile
PR: [WIP] Support torch compile based pass manager framework #10987
Issue: [RFC] SGLang unified kernel fusion and torch compile optimisations #10118 -
Mixed chunked prefill refactor
PoC: @hzh0425 @yizhang2077
Issue: (unify compatibility) coming soon.
Parallelism
-
Pipeline parallelism refactor for long-context prefill and high-throughput decoding
PoC: @ShangmingCai
Slack: #pipeline-parallel
Issue: [Roadmap] Pipeline parallelism refactoring roadmap #11857 -
Expert parallelism refactor
PoC: @ch-wan
Slack: #expert-parallel
Issue: [Roadmap] MoE Refactor #8715
Elastic parallel PRs: [1/N] Introduce Mooncake Backend and Mooncake EP to Support Elastic EP #10423, [4/N]Elastic EP support deepep backend #11837 -
Context parallelism
Candidate issues/PRs -
Compatibility goals
- All parallelisms + speculative decoding
- All parallelisms + PD disaggregation
- Multiple load balancing strategies for DP attention/system (minimal tokens, shortest queue) DP: support piggyback server load report #11469
-
GB200/GB300 NVL72 optimizations
PoC: @Fridge003 @fzyzcjy
Slack: #deepseek-large-scale-serving
Server Reliability
- Illegal memory access fixes. [Bug] illegal memory access / illegal instruction / memory leak #11968
- Runtime memory/paging checker.
- Grammar crash fault tolerance.
- Server crash fault tolerance.
Kernel
-
Integrate Blackwell fp4/fp8 attention, gemm, group gemm kernels (flashinfer)
Slack: #flashinfer-kernels -
Tune FP8 gemm in Cutlass
Slack: #kernel-dev -
Communication kernel work
Slack: #kernel-dev- NCCL symmetric memory (PRs: Add support for NCCL symmetric memory for TP allreduces #8238, Register allgather/reducescatter buffers with symm memory #12572)
- Overlap TP communication with compute (e.g., [WIP] Support TP overlap #9058)
- Integrate additional A2A kernels (e.g., pplx)
-
Automated nightly fusion detection
Workflow: https://github.com/sgl-project/sglang/actions/runs/19004823026
Slack: #ci-cd-build-release
Speculative Decoding
- General speculative algorithm abstraction to support multiple algorithms
- Hybrid algorithm combining Eagle and ngram
- Adaptive algorithm that adjusts speculative parameters during runtime
- Slack: #spec-decoding
PD Disaggregation
- Support radix cache on decode engines
- Refactor scheduler loop to reuse more code
- More plans: [Roadmap] Distributed Serving Enhancement on 2025 H2 #8210
- Auto scaling in OME
- Comprehensive NIXL and Dynamo integration
- Slack: #pd-disaggregation
KV Cache System & Memory Pool
-
PoC: @xiezhq-hermann
Issue: [Feature] HiCache for Hybrid and Sparse LLMs #12826.
slack #kv-cache-store -
Sparse attention and KV cache scheduler for GPU/CPU
PR: [Feature][WIP] Support Sparse Attention and KV cache scheduling between CPU and GPU for GQA/DSA. #11191
Diffusion (Multimodal Generation)
- PoC: @mickqian
- Roadmap: [Roadmap] Diffusion (2025 Q4) #12799
- Slack: #diffusion
Multimodal Models
-
Day-0 support for major models; add more OCR models
Contributors: @mick @JustinTong0323 @yuan-luo -
Performance improvements: better prefix & embedding cache
-
Faster CUDA IPC in MQ for large video/images
PR: [FEAT] Shared mem pool based cuda ipc for multi-modal data transport #11917Slack: #multi-modal
Quantization
-
General support for various quantization formats
Issue: [Roadmap] Quantization Support #8180 -
ModelOpt support
PoC: @Edwardf0t1
Slack: #modelopt -
Communication quantization (fp4/fp8 allreduce/allgather/alltoall)
Slack: #quantization
Multi-LoRA Serving
- Major roadmap: [Roadmap] Lora Support #2929
PoC: @Fridge003 - OpenAI-compatible APIs
PR: [FEATURE] Add OpenAI-Compatible LoRA Adapter Selection #11570 - LoRA for speculative decoding Support spec decoding when LoRA is applied to target model #12903
Contributor: @ConnorLi96 @lifuhuang - Async LoRA prefetch [Feature] Asynchronous LoRA prefetch #8712
Contributor: @ConnorLi96 @lifuhuang - LoRA for MoE layers
Issue: [Feature] Comprehensive LoRA Adapter Support for MOE Models #11894
Contributors: @ConnorLi96 @Jonahcb
Slack: #lora
RL Framework Integration
- AReaL, slime, verl integration (sorted alphabetically)
- Customized weight refitting from RDMA, etc @zhaochenyang20 @JD-ETH
- Open recipe of large-scale MoE training (Deepseek/Kimi/GLM) + GRPO training
- Systematic and algorithm mitigation for training-inference mismatch @zhaochenyang20 @fzyzcjy @Fridge003 @zyzshishui
- Support SGLang Gateway as the DP scheduler for rollout in the RL framework
- Tinker-like serverless RL APIs; @zhaochenyang20
- Native NVFP8 Training; @GeLee-Q @xieck13 @fy1214
- VLM RL with FSDP; @nanjiangwill @minleminzui
- Speculative Training; @guapisolo
Slack: #reinforcement-learning, #slime-rl-framework
Hardware
- AMD roadmap (2025 Q4): @HaiShaw
- TPU roadmap (2025 Q4)
- NPU roadmap (2025 Q4): @iforgetmyname @ZhengdQin (coming soon)
- Intel CPU/XPU roadmap (2025 Q4):
- Better multi-backend abstraction: @Alcanderian
Model Coverage
- Day-0 model support for all major models
PoC: @wisclmy0611 @JustinTong0323
Slack: #dev
Model Gateway & API Layer
-
Support multimodality and image processor in gRPC mode
-
Support PII and classify API for classifying intent and complexity of the input
-
Semantic Routing Support
-
Allow Gateway to actively listen to SGLang server's KV cache events to better handle routing decisions in gRPC mode
-
Allow SGLang server to start with both gRPC and HTTP server
-
Model Gateway terminal UI
-
Reactive UI to launch workers remotely; this should support both local machine and remote
-
Natively support Anthropic Message API instead of wrapping around chat completion in gRPC mode
-
Gateway SDK, supporting golang, python, and node.js for every rust crate (policies, tokenizer, parsers etc)
-
Metrics enhancement, including tracing, model specific metrics (TTFT, TPOT etc)
-
PoC: @slin1237 @CatherineSue
Issue: SGLang Autonomous Model Gateway Roadmap #13098
Slack: #router-sig
CI / Release / Maintenance
-
Improve CI monitor workflow
- Automatically track accuracy & performance metrics with standard format
- Regression detection & alerts
-
Improve nightly tests
- Add more models (Deepseek, GPT-OSS, Qwen3-next)
-
Full feature coverage CI with all combinations (every two days)
Slack: #ci-cd-build-release, #help-desk