Skip to content

Conversation

darraghdog
Copy link
Owner

PR title

Please write the PR title by following template:

[JIRA ticket link/nvbug link/github issue link][fix/feat/doc/infra/...] <summary of this PR>

For example, assume I have a PR hope to support a new feature about cache manager of Jira TRTLLM-1000 ticket, it would be like

[TRTLLM-1000][feat] Support a new feature about cache manager

Description

Please explain the issue and the solution in short.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

kevinch-nv and others added 30 commits May 21, 2025 21:10
* Add pytorch backend team

Signed-off-by: Kevin Chen 

* Update .github/CODEOWNERS

Co-authored-by: Yanchao Lu 
Signed-off-by: juney-nvidia <[email protected]>

---------

Signed-off-by: Kevin Chen 
Signed-off-by: juney-nvidia <[email protected]>
Co-authored-by: juney-nvidia <[email protected]>
Co-authored-by: Yanchao Lu
…rf-tests (cpp) (#4499)

add low concurrency perf tests

Signed-off-by: Venky <[email protected]>
* Adding two-shot allreduce kernel and mnnvl multicasting buffergit gffe

Signed-off-by: Shiyu Li <[email protected]>

Adding comments

Signed-off-by: Shiyu Li <[email protected]>

Add unittest of the twoshot kernel.

Signed-off-by: Shiyu Li <[email protected]>

Update dispatch logic

Signed-off-by: Shiyu Li <[email protected]>

Use cpu barrier instead of GPU at init

Signed-off-by: Shiyu Li <[email protected]>

Merge dispatch logic fix

Signed-off-by: Shiyu Li <[email protected]>

Update the kernel to use GPU-managed buffer

Signed-off-by: Shiyu Li <[email protected]>

* Refine

Signed-off-by: Zongfei Jing <[email protected]>

* Clean code

Signed-off-by: Zongfei Jing <[email protected]>

* Fix compile error

Signed-off-by: Zongfei Jing <[email protected]>

* Fix issue

Signed-off-by: Zongfei Jing <[email protected]>

* Clean up

Signed-off-by: Zongfei Jing <[email protected]>

* Simplify AllReduce interface

Signed-off-by: Zongfei Jing <[email protected]>

* Rename

Signed-off-by: Zongfei Jing <[email protected]>

* Fix warning

Signed-off-by: Zongfei Jing <[email protected]>

* Tidy code

Signed-off-by: Zongfei Jing <[email protected]>

* Rename

Signed-off-by: Zongfei Jing <[email protected]>

* Fix compile error

Signed-off-by: Zongfei Jing <[email protected]>

* Refine

Signed-off-by: Zongfei Jing <[email protected]>

* Skip ut for no_fusion

Signed-off-by: Zongfei Jing <[email protected]>

* Refine

Signed-off-by: Zongfei Jing <[email protected]>

---------

Signed-off-by: Shiyu Li <[email protected]>
Signed-off-by: Zongfei Jing <[email protected]>
Co-authored-by: Shiyu Li <[email protected]>
* agentConnection

Signed-off-by: Chuang Zhu <[email protected]>

recv

Signed-off-by: Chuang Zhu <[email protected]>

agentState

Signed-off-by: Chuang Zhu <[email protected]>

NIXL interfaces

Signed-off-by: Shixiaowei02 <[email protected]>

update cmakelists

Signed-off-by: Shixiaowei02 <[email protected]>

nixl improve

Signed-off-by: Chuang Zhu <[email protected]>

remove cppzmq

Signed-off-by: Chuang Zhu <[email protected]>

fix

Signed-off-by: Chuang Zhu <[email protected]>

transferAgent remove register

Signed-off-by: Chuang Zhu <[email protected]>

work for cache Test

Signed-off-by: Chuang Zhu <[email protected]>

reduce sleep time

Signed-off-by: Chuang Zhu <[email protected]>

fix test

Signed-off-by: Chuang Zhu <[email protected]>

intergarte

Signed-off-by: Chuang Zhu <[email protected]>

nixl env

Signed-off-by: Chuang Zhu <[email protected]>

fix rebase error

Signed-off-by: Chuang Zhu <[email protected]>

cpp test

Signed-off-by: Chuang Zhu <[email protected]>

stash for send metaData

Signed-off-by: Chuang Zhu <[email protected]>

loadRemoteMD after fetchRemoteMD

Signed-off-by: Chuang Zhu <[email protected]>

workaround for mixed gen and context

Signed-off-by: Chuang Zhu <[email protected]>

test_env

Signed-off-by: Chuang Zhu <[email protected]>

avoid port conflict in test

Signed-off-by: Chuang Zhu <[email protected]>

* format

Signed-off-by: Chuang Zhu <[email protected]>

* use std::string

Signed-off-by: Chuang Zhu <[email protected]>

* typo

Signed-off-by: Chuang Zhu <[email protected]>

* fix transferAgentTest

Signed-off-by: Chuang Zhu <[email protected]>

---------

Signed-off-by: Chuang Zhu <[email protected]>
* partition LlmArgs

Signed-off-by: Superjomn <[email protected]>

* update backend

Signed-off-by: Superjomn <[email protected]>

---------

Signed-off-by: Superjomn <[email protected]>
Add all_reduce.py script to test

Signed-off-by: Kaiyu Xie <[email protected]>
* feat: add dataset support for benchmark_core_model with LLMAPI

Signed-off-by: Aurelien Chartier <[email protected]>
#3972)

* Remove waived cases
* Remove test cases of not supported feature

Signed-off-by: Hui Gao <[email protected]>
* Add tritonrelease container

Signed-off-by: Iman Tabrizian <[email protected]>

* Review comments

Signed-off-by: Iman Tabrizian <[email protected]>

* Update docker/Makefile

Co-authored-by: Martin Marciniszyn Mehringer <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>

---------

Signed-off-by: Iman Tabrizian <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>
Co-authored-by: Martin Marciniszyn Mehringer <[email protected]>
waive hanging cases

Signed-off-by: Ruodi <[email protected]>
* update waive list

Signed-off-by: xinhe-nv <[email protected]>

* fix test issues

Signed-off-by: xinhe-nv <[email protected]>

---------

Signed-off-by: xinhe-nv <[email protected]>
* clean up _merge_dummy_request method of PyExecutor

Signed-off-by: junq <[email protected]>

* fix ci

Signed-off-by: junq <[email protected]>

* clean

Signed-off-by: junq <[email protected]>

* update comment

Signed-off-by: junq <[email protected]>

---------

Signed-off-by: junq <[email protected]>
stash for debug broken promise

Signed-off-by: Chuang Zhu <[email protected]>
[fix] Fix chunked prefill + overlap scheduler

Signed-off-by: Mike Iovine <[email protected]>
* Integrate chunked attention kernels

Signed-off-by: Mike Iovine <[email protected]>

* Fix cache key

Signed-off-by: Mike Iovine <[email protected]>

* Fix lint

Signed-off-by: Mike Iovine <[email protected]>

---------

Signed-off-by: Mike Iovine <[email protected]>
clean up _gather_dp_requests_num method of PyExecutor

Signed-off-by: junq <[email protected]>
…er (#4573)

fix moe possible race cond and add bypass worker thread for no updates

Signed-off-by: Dongxu Yang <[email protected]>
* support mcp

# Conflicts:
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <[email protected]>

* move all into contrib/mcp

# Conflicts:
#	examples/scaffolding/contrib/mcp/mcptest.py
#	tensorrt_llm/scaffolding/__init__.py
#	tensorrt_llm/scaffolding/contrib/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py
#	tensorrt_llm/scaffolding/task.py
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <[email protected]>

* support sandbox, websearch

# Conflicts:
#	examples/scaffolding/contrib/mcp/mcptest.py
#	examples/scaffolding/contrib/mcp/weather/weather.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_controller.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py
#	tensorrt_llm/scaffolding/worker.py

Signed-off-by: wu1du2 <[email protected]>

* remove pics

Signed-off-by: wu1du2 <[email protected]>

* pre-commit fix

# Conflicts:
#	tensorrt_llm/scaffolding/contrib/mcp/__init__.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_utils.py
#	tensorrt_llm/scaffolding/contrib/mcp/mcp_worker.py

Signed-off-by: wu1du2 <[email protected]>

* fix spell

Signed-off-by: wu1du2 <[email protected]>

* rebase

Signed-off-by: wu1du2 <[email protected]>

---------

Signed-off-by: wu1du2 <[email protected]>
* feat: Enabling dis serving with TRT backend with Python runtime

Signed-off-by: Patrice Castonguay <[email protected]>

* Fixing formatting

Signed-off-by: Patrice Castonguay <[email protected]>

* Fixing disagg mtp test

Signed-off-by: Patrice Castonguay <[email protected]>

---------

Signed-off-by: Patrice Castonguay <[email protected]>
arthurrasmusson and others added 29 commits May 28, 2025 23:27
Signed-off-by: Arthur Rasmusson <[email protected].>
Co-authored-by: Robin Kobus <[email protected]>
Co-authored-by: Aurelien Chartier <[email protected]>
Signed-off-by: Yiqing Yan <[email protected]>
…image groovy and support NGC images (#4294)

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Jhao-Ting Chen <[email protected]>
Co-authored-by: Haohang Huang <[email protected]>
Signed-off-by: Chenfei Zhang <[email protected]>
Signed-off-by: Yilin Fan <[email protected]>
Co-authored-by: Chenfei Zhang <[email protected]>
Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com>
Co-authored-by: Hao Lu <[email protected]@users.noreply.github.com>
@darraghdog darraghdog merged commit d57bb09 into darraghdog:main May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.