- 
        Couldn't load subscription status. 
- Fork 1.8k
feat: large-scale EP(part 7: DeepEP integration) #4792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9c54709    to
    c3b20cd      
    Compare
  
    | Notes from offline discussion 
 | 
| /bot -h | 
| GitHub Bot Help
 Provide a user friendly way for developers to interact with a Jenkins server. Run  See details below for each supported subcommand. 
 Launch build/test pipelines. All previously running jobs will be killed. 
 
 
 
 
 
 
 
 
 kill
 Kill all running builds associated with pull request. skip
 Skip testing for latest commit on pull request.  reuse-pipeline
 Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. | 
| /bot run --stage-list "Build-Docker-Images" | 
| /bot -h | 
| GitHub Bot Help
 Provide a user friendly way for developers to interact with a Jenkins server. Run  See details below for each supported subcommand. 
 Launch build/test pipelines. All previously running jobs will be killed. 
 
 
 
 
 
 
 
 
 kill
 Kill all running builds associated with pull request. skip
 Skip testing for latest commit on pull request.  reuse-pipeline
 Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. | 
| PR_Github #7071 [ run ] triggered by Bot | 
| PR_Github #7071 [ run ] completed with state  | 
| 
 @yuantailing Please fix the style following the guidance https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#coding-style | 
09f4f14    to
    ec39fda      
    Compare
  
    | /bot run --stage-list "Build-Docker-Images" | 
| PR_Github #7270 [ run ] triggered by Bot | 
| PR_Github #7270 [ run ] completed with state  | 
8995de6    to
    9c5fc69      
    Compare
  
    | /bot run --disable-fail-fast --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" | 
| PR_Github #8785 [ run ] triggered by Bot | 
| PR_Github #8785 [ run ] completed with state  | 
| OOM on a previously passed test case:  The test case was passed in PR_Github #8651 There is no code change between these two CI runs.  | 
| /bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" | 
| PR_Github #8822 [ run ] triggered by Bot | 
| Compare the environment of PR_Github #8651 and PR_Github #8785 Pipeline 8651 installed  Pipeline 8651: Pipeline 8785:  | 
| PR_Github #8822 [ run ] completed with state  | 
| Build timeout. Note that #5027 changed  | 
| Maybe the second build can reuse ccache. Run again. | 
| /bot run --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" | 
| PR_Github #8866 [ run ] triggered by Bot | 
| PR_Github #8866 [ run ] completed with state  | 
| ToT failure in the  Merge main and test again. | 
| /bot run --disable-fail-fast --stage-list "DGX_H100-4_GPUs-PyTorch-DeepSeek-1" | 
| PR_Github #8873 [ run ] triggered by Bot | 
| PR_Github #8873 [ run ] completed with state  | 
| The rerun test is  I noticed that PR #5140 reran  Both reruns happen in the same file and have the same call stack. So I believe the root cause is ToT high failure rate in  Appendix: call stack  | 
| /bot skip --comment "PR_Github #8541, PR_Github #8651, and PR_Github #8873 form a full test. The main branch grows 39 commits from the first test." | 
| PR_Github #8883 [ skip ] triggered by Bot | 
| PR_Github #8883 [ skip ] completed with state  | 
| @yuantailing Hi, I tried to enable DeepEP and found num_nvl_peers and comm is not params of DeepEP's Buffer init function. So I guess you modified DeepEP's source code? ---- I've figured out how to install the modified DeepEP. Please see docker/common/install_deep_ep.sh | 
| Hi @WanchaoYao , | 
DeepEP integration
Description
Support matrix:
Please refer to
select_alltoall_method_type(infused_moe_cutlass.py) for the condition of enabling DeepEP or DeepEPLowLatency. This is an experimental feature, so an environment variableTRTLLM_CAN_USE_DEEP_EP=1is required.One of the following lines will be printed at initialization:
Known issues:
TRTLLM_MOE_POST_QUANT_ALLTOALLV=0instead.Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]Launch build/test pipelines. All previously running jobs will be killed.
--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-[Post-Merge]-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.