-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Big thanks to the DeepSeek team for their awesome works! Recently, large-scale fine-grained MoE models have been gaining popularity, but they also bring new optimization challenges (and opportunities) for LLM inference systems. One key technique to make models like DeepSeek V3/R1 run efficiently is large-scale EP (Expert Parallelism) – it not only leverages aggregated memory bandwidth to reduce latency but also helps maximize compute utilization.
Getting large-scale EP to work well isn't easy, and we really appreciate the DeepSeek team sharing their insights and optimization tricks through both their tech report and open-source code(DeepEP and EBLP). Shoutout to the SGLang team too, who recently did great work implementing large-scale EP using DeepSeek's components plus their own innovations!
On the TensorRT-LLM side, we've been working on large-scale EP support for a while. Our approach might differ slightly from other solutions – we're particularly focused on supporting NVIDIA's latest hardware (like GB200) as well as other architectures (B200, Hopper, etc.).
We're also putting extra effort into designing an end-to-end system that handles both large-scale EP execution and dynamic workload balancing to adapt to real-time traffic changes, making deployment smoother for users. To be clear, we don't think these ideas are unique to TensorRT-LLM – in fact, we're pretty sure teams like DeepSeek have already implemented similar approaches in their internal systems (judging from their published tech report). We've learned a ton from DeepSeek's paper and code, and we're grateful they've shared their work with the community!
Motivated by DeepSeek's work, and also to make TensorRT-LLM technical execution more transparent, also to provide a channel for the community to get engaged into TensorRT-LLM core development work at the early stage, we are now sharing the concrete plan of supporting large-scale EP in TensorRT-LLM to the community to get early feedback, your comments/suggestion and contributions are highly appreciated:
- Communication component
- Customized MoE A2A communication kernels for large-scale EP
- [Done] GB200 support @dongxuy04
- [Ongoing] B200/Hopper support @Tailing Yuan @jhaotingc @meng Wang
- Being investigated now, for this specific area, there are great work from DeepSeek(DeepEP work) and Perplexity(PPLX work), and based on our current limited understanding, they both have pros and cons, so we are not rushing with the integration, rather we are doing more technical due-diligence to figure out a reasonable technical solution.
- [Ongoing] DeepEP integration @yuantailing
- Customized MoE A2A communication kernels for large-scale EP
- EP balancer component(most of the work for this component can be applied to multiple GPU architectures)
- [Done] Statistics and Routing kernels @dongxuy04
- [Done] Remapping - synchronization logics @dongxuy04
- [Done] Replication and placement logics @dongxuy04
- [Done] FusedMoE module changes @wm2012011492
- [Done] Experts loading and sharing @dongxuy04
- E2E workflow integration
- [Done] Static EP load balancer @syuoni
- [Ongoing] Static EP load balancer with offline statistics @syuoni
- [Ongoing] Online EP load balancer and E2E validation @dongxuy04
- [Ongoing] E2E validation with dis-agg serving @kaiyux
- Performance tuning/analysis/optimization
-
[Ongoing] E2E performance measurement/study @qiaoxj07
-
[Ongoing] Allgather communication(before the A2A communication) optimization @WeiHaocheng
-
[Ongoing] MoE related kernels optimization @syuoni
-
To make the community easier to understand what we are doing now and what we plan to do, here is the high-level design overview done by @dongxuy04 (thanks for Dongxu's great technical work to make the current design):
We are also considering initiating a detailed design review&discussion with the community if there are enough interests, thus to help the community understand more of the current plan to encourage the community engagement.
Thanks
The TensorRT-LLM Engineering Team