Skip to content

[Call for contributions]The development plan of large-scale EP support in TensorRT-LLM #4127

@juney-nvidia

Description

@juney-nvidia

Big thanks to the DeepSeek team for their awesome works! Recently, large-scale fine-grained MoE models have been gaining popularity, but they also bring new optimization challenges (and opportunities) for LLM inference systems. One key technique to make models like DeepSeek V3/R1 run efficiently is large-scale EP (Expert Parallelism) – it not only leverages aggregated memory bandwidth to reduce latency but also helps maximize compute utilization.

Getting large-scale EP to work well isn't easy, and we really appreciate the DeepSeek team sharing their insights and optimization tricks through both their tech report and open-source code(DeepEP and EBLP). Shoutout to the SGLang team too, who recently did great work implementing large-scale EP using DeepSeek's components plus their own innovations!

On the TensorRT-LLM side, we've been working on large-scale EP support for a while. Our approach might differ slightly from other solutions – we're particularly focused on supporting NVIDIA's latest hardware (like GB200) as well as other architectures (B200, Hopper, etc.).

We're also putting extra effort into designing an end-to-end system that handles both large-scale EP execution and dynamic workload balancing to adapt to real-time traffic changes, making deployment smoother for users. To be clear, we don't think these ideas are unique to TensorRT-LLM – in fact, we're pretty sure teams like DeepSeek have already implemented similar approaches in their internal systems (judging from their published tech report). We've learned a ton from DeepSeek's paper and code, and we're grateful they've shared their work with the community!

Motivated by DeepSeek's work, and also to make TensorRT-LLM technical execution more transparent, also to provide a channel for the community to get engaged into TensorRT-LLM core development work at the early stage, we are now sharing the concrete plan of supporting large-scale EP in TensorRT-LLM to the community to get early feedback, your comments/suggestion and contributions are highly appreciated:

To make the community easier to understand what we are doing now and what we plan to do, here is the high-level design overview done by @dongxuy04 (thanks for Dongxu's great technical work to make the current design):

Image

We are also considering initiating a detailed design review&discussion with the community if there are enough interests, thus to help the community understand more of the current plan to encourage the community engagement.

Thanks

The TensorRT-LLM Engineering Team

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions