[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM

Big thanks to the DeepSeek team for their awesome works! Recently, large-scale fine-grained MoE models have been gaining popularity, but they also bring new optimization challenges (and opportunities) for LLM inference systems. One key technique to make models like DeepSeek V3/R1 run efficiently is large-scale EP (Expert Parallelism) – it not only leverages aggregated memory bandwidth to reduce latency but also helps maximize compute utilization.

Getting large-scale EP to work well isn't easy, and we really appreciate the DeepSeek team sharing their insights and optimization tricks through both their [tech report ](https://arxiv.org/abs/2412.19437)and open-source code([DeepEP](https://github.com/deepseek-ai/DeepEP) and [EBLP](https://github.com/deepseek-ai/EPLB)). Shoutout to the SGLang team too, who recently did [great work](https://lmsys.org/blog/2025-05-05-large-scale-ep/) implementing large-scale EP using DeepSeek's components plus their own innovations!

On the TensorRT-LLM side, we've been working on large-scale EP support for a while. Our approach might differ slightly from other solutions – we're particularly focused on supporting NVIDIA's latest hardware (like GB200) as well as other architectures (B200, Hopper, etc.). 

We're also putting extra effort into designing an end-to-end system that handles both large-scale EP execution and dynamic workload balancing to adapt to real-time traffic changes, making deployment smoother for users. To be clear, we don't think these ideas are unique to TensorRT-LLM – in fact, we're pretty sure teams like DeepSeek have already implemented similar approaches in their internal systems (judging from their published tech report). We've learned a ton from DeepSeek's paper and code, and we're grateful they've shared their work with the community!

Motivated by DeepSeek's work, and also to make TensorRT-LLM technical execution more transparent, also to provide a channel for the community to get engaged into TensorRT-LLM core development work at the early stage, we are now sharing the concrete plan of supporting large-scale EP in TensorRT-LLM to the community to get early feedback, your comments/suggestion and contributions are highly appreciated:
- Communication component
  - Customized MoE A2A communication kernels for large-scale EP
    -  **[Done]** [GB200 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3504) @dongxuy04 
    -  **[Ongoing]** B200/Hopper support  @**Tailing Yuan** @jhaotingc @Meng Wang
       - Being investigated now, for this specific area, there are great work from DeepSeek(DeepEP work) and Perplexity([PPLX work](https://github.com/ppl-ai/pplx-kernels)), and based on our current limited understanding, they both have pros and cons, so we are not rushing with the integration, rather we are doing more technical due-diligence to figure out a reasonable technical solution.  
       - **[Ongoing]** [DeepEP integration](https://github.com/NVIDIA/TensorRT-LLM/pull/4792)  @yuantailing 
- EP balancer component(most of the work for this component can be applied to multiple GPU architectures)
  - **[Done]**[ Statistics and Routing kernels](https://github.com/NVIDIA/TensorRT-LLM/pull/4384) @dongxuy04 
  - **[Done]** [Remapping - synchronization logics](https://github.com/NVIDIA/TensorRT-LLM/pull/4384) @dongxuy04 
  - **[Done]** [Replication and placement logics](https://github.com/NVIDIA/TensorRT-LLM/pull/4384) @dongxuy04 
  - **[Done]** FusedMoE module changes @wm2012011492 
    - [PR1](https://github.com/NVIDIA/TensorRT-LLM/pull/4495) 
  - **[Done]** [Experts loading and sharing](https://github.com/NVIDIA/TensorRT-LLM/pull/4384)   @dongxuy04 
- E2E workflow integration
   - **[Done]** [Static EP load balancer](https://github.com/NVIDIA/TensorRT-LLM/pull/4615) @syuoni 
   - **[Ongoing]** [Static EP load balancer with offline statistics](https://github.com/NVIDIA/TensorRT-LLM/pull/4695) @syuoni 
   - **[Ongoing]**  [Online EP load balancer and E2E validation](https://github.com/NVIDIA/TensorRT-LLM/pull/4818/files#diff-c23048c02e9268d5d730aaa6d9aaa34459a69b80cd2b134713fbff12e35d00c7) @dongxuy04 
   - **[Ongoing]** E2E validation with dis-agg serving @kaiyux  
- Performance tuning/analysis/optimization 
  - **[Ongoing]** E2E performance measurement/study @qiaoxj07 

  - **[Ongoing]** Allgather communication(before the A2A communication) optimization @WeiHaocheng 
  - **[Ongoing]** MoE related kernels optimization @syuoni 

To make the community easier to understand what we are doing now and what we plan to do, here is the high-level design overview done by @dongxuy04 (thanks for Dongxu's great technical work to make the current design):

<img width="1726" alt="Image" src="https://github.com/user-attachments/assets/e12fa281-4a1c-4902-861d-04e2642f4a3d" />

We are also considering initiating a detailed design review&discussion with the community if there are enough interests, thus to help the community understand more of the current plan to encourage the community engagement. 

Thanks

The TensorRT-LLM Engineering Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM #4127

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM #4127

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions