Development Roadmap (2025 Q4)

# SGLang Roadmap — 2025 Q4

*Contributions and feedback are welcome*. [Join Slack](https://slack.sglang.ai).

## Focus

- **Feature compatibility & reliability**: Full compatibility and production-level reliability across P/D disaggregation, all parallelisms, speculative decoding, HiCache, and load balancing.
- **Usability**: Easy installation on NV/AMD/TPU/CPU; simple large-scale deployment (k8s, OME).
- **Kernel optimization** for next-gen hardware (GB300/GB200, B300/B200, MI350/MI355, TPU).
- **Reinforcement learning** framework integration and training-inference mismatch mitigation.


## Base Engine Features

- **Overlap scheduler compatibility with speculative decoding and all features**
  PoC: @hnyls2002  
  Slack: [#spec-decoding](https://sgl-fru7574.slack.com/archives/C09KELDAD8U)  
  Issue: https://github.com/sgl-project/sglang/issues/11762

- **Prefill CUDA graph**  
  PoC: @Oasis-Git @ispobock @BBuf  
  Slack: [#piecewise-cuda-graph](https://sgl-fru7574.slack.com/archives/C09KZ1MV013)  
  Issue: https://github.com/sgl-project/sglang/issues/11490

- **Memory cache v2 refactor**  
  PoC: @cctry @xiezhq-hermann  
  Slack: [#prefix-cache](https://sgl-fru7574.slack.com/archives/C09QRSD94KE), [#kv-cache-store](https://sgl-fru7574.slack.com/archives/C095B2L7UEB)  
  Issue: https://github.com/sgl-project/sglang/issues/12587

- **Torch compile stack** (need PoC)  
  Slack: [#torch-compile](https://sgl-fru7574.slack.com/archives/C09C35Q8ZGE)  
  PR: https://github.com/sgl-project/sglang/pull/10987  
  Issue: https://github.com/sgl-project/sglang/issues/10118

- **Mixed chunked prefill refactor** 
  PoC: @hzh0425 @yizhang2077 
  Issue: (unify compatibility) coming soon.

<img width="3377" height="1145" alt="Image" src="https://github.com/user-attachments/assets/6e0ac970-aaff-43ff-b82b-3b1880b747e5" />

## Parallelism

- **Pipeline parallelism** refactor for long-context prefill and high-throughput decoding  
  PoC: @ShangmingCai  
  Slack: [#pipeline-parallel](https://sgl-fru7574.slack.com/archives/C09J7BY42PP)  
  Issue: https://github.com/sgl-project/sglang/issues/11857

- **Expert parallelism** refactor  
  PoC: @ch-wan  
  Slack: [#expert-parallel](https://sgl-fru7574.slack.com/archives/C09QRUHFJTE)  
  Issue: https://github.com/sgl-project/sglang/issues/8715  
  Elastic parallel PRs: https://github.com/sgl-project/sglang/pull/10423, https://github.com/sgl-project/sglang/pull/11837

- **Context parallelism** 
Candidate issues/PRs 
  - https://github.com/sgl-project/sglang/pull/12065
  - https://github.com/sgl-project/sglang/pull/12820
  - https://github.com/sgl-project/sglang/issues/12196
  - https://github.com/sgl-project/sglang/pull/12207

- **Compatibility goals**  
  - All parallelisms + speculative decoding  
  - All parallelisms + PD disaggregation  
  - Multiple load balancing strategies for DP attention/system (minimal tokens, shortest queue) #11469

- **GB200/GB300 NVL72 optimizations**  
  PoC: @Fridge003 @fzyzcjy  
  Slack: [#deepseek-large-scale-serving](https://sgl-fru7574.slack.com/archives/C08QGMU93GX)

## Server Reliability

- Illegal memory access fixes. #11968
- Runtime memory/paging checker.
- Grammar crash fault tolerance.
- Server crash fault tolerance.

## Kernel

- Integrate Blackwell fp4/fp8 attention, gemm, group gemm kernels (flashinfer)  
  Slack: [#flashinfer-kernels](https://sgl-fru7574.slack.com/archives/C09NG5Q0LEP)

- Tune FP8 gemm in Cutlass  
  Slack: [#kernel-dev](https://sgl-fru7574.slack.com/archives/C09NFSN642G)

- Communication kernel work  
  Slack: [#kernel-dev](https://sgl-fru7574.slack.com/archives/C09NFSN642G)  
  - NCCL symmetric memory (PRs: https://github.com/sgl-project/sglang/pull/8238, https://github.com/sgl-project/sglang/pull/12572)  
  - Overlap TP communication with compute (e.g., https://github.com/sgl-project/sglang/pull/9058)  
  - Integrate additional A2A kernels (e.g., pplx)

- Automated nightly fusion detection  
  Workflow: https://github.com/sgl-project/sglang/actions/runs/19004823026  
  Slack: [#ci-cd-build-release](https://sgl-fru7574.slack.com/archives/C09HCG2HM1T)

## Speculative Decoding
- General speculative algorithm abstraction to support multiple algorithms  
- Hybrid algorithm combining Eagle and ngram  
- Adaptive algorithm that adjusts speculative parameters during runtime  
- Slack: [#spec-decoding](https://sgl-fru7574.slack.com/archives/C09KELDAD8U)

## PD Disaggregation
- Support radix cache on decode engines  
- Refactor scheduler loop to reuse more code  
- More plans: https://github.com/sgl-project/sglang/issues/8210  
- Auto scaling in OME  
- Comprehensive NIXL and Dynamo integration  
- Slack: [#pd-disaggregation](https://sgl-fru7574.slack.com/archives/C08AP4WU8P3)

## KV Cache System & Memory Pool

- PoC: @xiezhq-hermann 
  Issue: https://github.com/sgl-project/sglang/issues/12826.
  slack [#kv-cache-store](https://sgl-fru7574.slack.com/archives/C095B2L7UEB)

- Sparse attention and KV cache scheduler for GPU/CPU
  PR: https://github.com/sgl-project/sglang/pull/11191

## Diffusion (Multimodal Generation)
- PoC: @mickqian
- Roadmap: #12799   
- Slack: [#diffusion](https://sgl-fru7574.slack.com/archives/C09P0HTKE6A)

## Multimodal Models

- Day-0 support for major models; add more OCR models  
  Contributors: @mick @JustinTong0323 @yuan-luo
- Performance improvements: better prefix & embedding cache
- Faster CUDA IPC in MQ for large video/images  
  PR: https://github.com/sgl-project/sglang/pull/11917

  Slack: [#multi-modal](https://sgl-fru7574.slack.com/archives/C087RGPBC81)

## Quantization

- General support for various quantization formats  
  Issue: https://github.com/sgl-project/sglang/issues/8180
- ModelOpt support  
  PoC: @Edwardf0t1  
  Slack: [#modelopt](https://sgl-fru7574.slack.com/archives/C09NPJSBR32)
- Communication quantization (fp4/fp8 allreduce/allgather/alltoall)

  Slack: [#quantization](https://sgl-fru7574.slack.com/archives/C08976KGBQF)

## Multi-LoRA Serving

- Major roadmap: https://github.com/sgl-project/sglang/issues/2929  
  PoC: @Fridge003
- OpenAI-compatible APIs  
  PR: https://github.com/sgl-project/sglang/pull/11570
- LoRA for speculative decoding #12903 
  Contributor: @ConnorLi96 @lifuhuang 
- Async LoRA prefetch https://github.com/sgl-project/sglang/issues/8712
  Contributor: @ConnorLi96 @lifuhuang 
- LoRA for MoE layers  
  Issue: https://github.com/sgl-project/sglang/issues/11894  
  Contributors: @ConnorLi96 @Jonahcb
Slack: [#lora](https://sgl-fru7574.slack.com/archives/C09JDPAP3FA)

## RL Framework Integration

- AReaL, slime, verl integration (sorted alphabetically) 
- Customized weight refitting from RDMA, etc @zhaochenyang20 @JD-ETH
- Open recipe of large-scale MoE training (Deepseek/Kimi/GLM) + GRPO training  
- Systematic and algorithm mitigation for training-inference mismatch @zhaochenyang20 @fzyzcjy @Fridge003 @zyzshishui 
- Support SGLang Gateway as the DP scheduler for rollout in the RL framework
- Tinker-like serverless RL APIs; @zhaochenyang20 
- Native NVFP8 Training; @GeLee-Q @xieck13 @fy1214
- VLM RL with FSDP; @nanjiangwill @minleminzui 
- Speculative Training; @guapisolo 

Slack: [#reinforcement-learning](https://sgl-fru7574.slack.com/archives/C09HMG80PNE), [#slime-rl-framework](https://sgl-fru7574.slack.com/archives/C09E0QSGARH)

## Hardware

- AMD roadmap (2025 Q4): @HaiShaw  
  - https://github.com/sgl-project/sglang/issues/12890
- TPU roadmap (2025 Q4)
  - https://github.com/sgl-project/sglang-jax/issues/190
  - Slack: [#dev-jax-tpu](https://sgl-fru7574.slack.com/archives/C09EBE5HT5X)
- NPU roadmap (2025 Q4): @iforgetmyname @ZhengdQin (coming soon)  
- Intel CPU/XPU roadmap (2025 Q4): 
  - https://github.com/sgl-project/sglang/issues/12802
  - https://github.com/sgl-project/sglang/issues/12806
- Better multi-backend abstraction: @Alcanderian

## Model Coverage

- Day-0 model support for all major models  
  PoC: @wisclmy0611 @JustinTong0323  
  Slack: [#dev](https://sgl-fru7574.slack.com/archives/C07PEP77X6F)

## Model Gateway & API Layer

- Support multimodality and image processor in gRPC mode
- Support PII and classify API for classifying intent and complexity of the input
- Semantic Routing Support
- Allow Gateway to actively listen to SGLang server's KV cache events to better handle routing decisions in gRPC mode
- Allow SGLang server to start with both gRPC and HTTP server
- Model Gateway terminal UI
- Reactive UI to launch workers remotely; this should support both local machine and remote
- Natively support Anthropic Message API instead of wrapping around chat completion in gRPC mode
- Gateway SDK, supporting golang, python, and node.js for every rust crate (policies, tokenizer, parsers etc)
- Metrics enhancement, including tracing, model specific metrics (TTFT, TPOT etc)

- PoC: @slin1237  @CatherineSue 
  Issue: https://github.com/sgl-project/sglang/issues/13098
  Slack: [#router-sig](https://sgl-fru7574.slack.com/archives/C09E1U4LL6Q)


## CI / Release / Maintenance

- Improve [CI monitor](https://github.com/sgl-project/sglang/actions/workflows/ci-monitor.yml) workflow
  - Automatically track accuracy & performance metrics with standard format  
  - Regression detection & alerts

- Improve [nightly tests](https://github.com/sgl-project/sglang/actions/workflows/nightly-test.yml) 
  - Add more models (Deepseek, GPT-OSS, Qwen3-next)

- Full feature coverage CI with all combinations (every two days)

Slack: [#ci-cd-build-release](https://sgl-fru7574.slack.com/archives/C09HCG2HM1T), [#help-desk](https://sgl-fru7574.slack.com/archives/C07EFURPNN9)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development Roadmap (2025 Q4) #12780

SGLang Roadmap — 2025 Q4

Focus

Base Engine Features

Parallelism

Server Reliability

Kernel

Speculative Decoding

PD Disaggregation

KV Cache System & Memory Pool

Diffusion (Multimodal Generation)

Multimodal Models

Quantization

Multi-LoRA Serving

RL Framework Integration

Hardware

Model Coverage

Model Gateway & API Layer

CI / Release / Maintenance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Development Roadmap (2025 Q4) #12780

Description

SGLang Roadmap — 2025 Q4

Focus

Base Engine Features

Parallelism

Server Reliability

Kernel

Speculative Decoding

PD Disaggregation

KV Cache System & Memory Pool

Diffusion (Multimodal Generation)

Multimodal Models

Quantization

Multi-LoRA Serving

RL Framework Integration

Hardware

Model Coverage

Model Gateway & API Layer

CI / Release / Maintenance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions