- 
                Notifications
    
You must be signed in to change notification settings  - Fork 1.8k
 
User/yuanjingx/slurm test tmp #7309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User/yuanjingx/slurm test tmp #7309
Conversation
          
📝 WalkthroughWalkthroughUnifies SLURM test execution into a node-count–aware API, adds helpers to assemble pytest invocations and node args, switches multi-node launch to a generated script executed remotely, remaps runtime labels to TRT-LLM variants, and refactors  Changes
 Sequence Diagram(s)sequenceDiagram
  autonumber
  participant Jenkins as Jenkins (L0_Test.groovy)
  participant Slurm as SlurmCluster
  participant Node as Remote Node (SSH)
  participant Launch as slurm_launch.sh
  participant PyTest as pytest
  Jenkins->>Jenkins: build pytestCommand via getPytestBaseCommandLine
  Jenkins->>Slurm: request allocation (nodeCount, gpu args via getNodeArgs)
  Jenkins->>Node: copy generated slurm_launch.sh and support files
  Jenkins->>Node: SSH -> execute slurm_launch.sh
  Node->>Launch: start, set env, inject placeholders via set_value_in_command
  Launch->>PyTest: eval pytestCommand
  PyTest-->>Launch: exit/status, produce coverage/artifacts
  Launch->>Launch: run optional perf checks, generate report
  Launch-->>Jenkins: exit status and artifacts
  Jenkins->>Slurm: cleanUpNodeResources (release)
    Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
 Suggested reviewers
 ✨ Finishing Touches🧪 Generate unit tests
 Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type  Other keywords and placeholders
 Status, Documentation and Community
  | 
    
| 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
| 
           PR_Github #16742 [ run ] triggered by Bot  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
jenkins/scripts/slurm_run.sh (1)
69-76: Trim values; don’t mangle paths.Avoid replacing spaces with underscores; just read and prepend cleanly.
-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}') -containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g') -containerLDLibPath=$LD_LIBRARY_PATH -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') +containerPipLLMLibPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}') +containerLDLibPath="${LD_LIBRARY_PATH}" if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath" containerLDLibPath="${containerLDLibPath%:}" fi export LD_LIBRARY_PATH=$containerLDLibPath
🧹 Nitpick comments (7)
jenkins/scripts/slurm_run.sh (2)
34-34: Install reliability: run apt-get update before installing.Precede install with update to reduce transient failures.
- apt-get install -y libffi-dev + apt-get update -y && apt-get install -y libffi-dev
81-81: Remove or fix unused variable in log.
$testCaseis never set.-echo "Running: $testCase" +echo "Running stage: $stageName"jenkins/L0_Test.groovy (5)
159-170: Validate Slurm task/GPU mapping; consider adding --gpus-per-task=1.With ceil() on gpusPerNode and
--ntasks=${gpuCount}, tasks/GPU mapping can drift on uneven splits. Make binding explicit.return nodeCount == 1 ? [ "--nodes=${nodeCount}", "--gpus=${gpuCount}" ] : [ "--nodes=${nodeCount}", "--ntasks=${gpuCount}", "--ntasks-per-node=${gpusPerNode}", "--gpus-per-node=${gpusPerNode}", + "--gpus-per-task=1", ]If your Slurm defaults already bind GPUs per task, please confirm.
172-230: Command builder looks solid; minor nit on empty args.Joining includes empty strings when not detailed logging; harmless but can be filtered to keep logs cleaner.
- testFilter[(DETAILED_LOG)] ? "-s" : "", + *(testFilter[(DETAILED_LOG)] ? ["-s"] : []),
266-287: Ensure slurm_run.sh exists from the unpacked tree.You extract the tar locally to copy
slurm_run.sh. If packaging ever changes, this path breaks. Add a quick existence check to fail early.- def scriptRunLocalPath = "${llmSrcLocal}/jenkins/scripts/slurm_run.sh" + def scriptRunLocalPath = "${llmSrcLocal}/jenkins/scripts/slurm_run.sh" + if (!fileExists(scriptRunLocalPath)) { + error "slurm_run.sh not found at ${scriptRunLocalPath}" + }
312-318: Quoting safety for exporting pytestCommand into the launch script.Long commands with quotes can break the here-doc export. Consider base64-encoding and decoding in slurm_run.sh, or at least escape embedded quotes.
Example (conceptual):
def cmdB64 = pytestCommand.bytes.encodeBase64().toString() scriptContent = """#!/bin/bash export pytestCommandB64="${cmdB64}" ... """ // In slurm_run.sh: pytestCommand="$(echo "$pytestCommandB64" | base64 -d)"Also applies to: 337-349
379-387: SSH execution path: add basic retry to handle transient SSH/SRUN flakiness.Occasional transient remote failures are common; a short retry loop improves robustness.
- """bash ${scriptLaunchPathNode}""" + """bash ${scriptLaunchPathNode}""" ,And wrap
Utils.execwith numRetries similar to earlier copy steps.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy(10 hunks)jenkins/scripts/slurm_run.sh(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 40-40: hostname is referenced but not assigned.
(SC2154)
[warning] 40-40: stageName is referenced but not assigned.
(SC2154)
[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[warning] 85-85: perfMode is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (3)
jenkins/L0_Test.groovy (3)
351-359: MPI flag only for multi-node: LGTM.Using
--mpi=pmi2conditionally makes sense with the containerizedsrunflow.Also applies to: 360-368
1390-1417: Unified pytest command path: LGTM.The builder plus rerun logic integration looks correct and readable.
Also applies to: 1438-1446
1818-1822: Stage mappings updated to TRT-LLM variants: LGTM.New x86/SBSA Slurm configs and multi-node stages align with the new runner.
Also applies to: 1847-1851, 1854-1862
| set_value_in_command() { | ||
| # Parameters | ||
| local key="$1" | ||
| local value="$2" | ||
| local command="$3" | ||
| 
               | 
          ||
| # Transform the key | ||
| local placeholder="__PLACEHOLDER_${key}__" | ||
| 
               | 
          ||
| # Check if placeholder exists | ||
| if [[ "$command" != *"$placeholder"* ]]; then | ||
| echo "Error: placeholder '$placeholder' not found in the command" >&2 | ||
| return 1 | ||
| fi | ||
| 
               | 
          ||
| # Replace all occurrences | ||
| local result="${command//${placeholder}/${value}}" | ||
| 
               | 
          ||
| # Return the result | ||
| echo "$result" | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fail fast on missing placeholders and surface errors.
set_value_in_command returns 1 if not found; callers don’t check, risking an empty pytestCommand. Either exit on error or handle status.
 set_value_in_command() {
@@
-    if [[ "$command" != *"$placeholder"* ]]; then
-        echo "Error: placeholder '$placeholder' not found in the command" >&2
-        return 1
-    fi
+    if [[ "$command" != *"$placeholder"* ]]; then
+        echo "Error: placeholder '$placeholder' not found in the command" >&2
+        exit 1
+    fi
 }Optionally also enable strict mode near the top:
 #!/bin/bash
+set -euo pipefailAlso applies to: 55-55, 67-67
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around lines 5 to 25 (also apply same treatment
at lines 55 and 67), the helper set_value_in_command currently returns a
non-zero status when a placeholder is missing but callers do not check the
return code, which can lead to unset/empty command variables; change the
function to fail-fast by printing the error to stderr and then calling exit 1
(or alternatively propagate the error by returning non-zero and ensure callers
immediately check and exit on failure), and enable strict mode near the top of
the script (set -euo pipefail and optionally IFS=$'\n\t') so any unexpected
failure halts execution; apply the same error-handling change to the other
placeholder-replacement usages at lines 55 and 67 to ensure missing placeholders
abort the run.
| git config --global --add safe.directory "*" | ||
| gpuUuids=$(nvidia-smi -q | grep "GPU UUID" | awk '{print $4}' | tr '\n' ',' || true) | ||
| echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" | ||
| echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix undefined hostname variable and extra '=' in log line.
Use command substitution; drop the duplicate '='.
-    echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" | |
| echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName" | 
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 40-40: hostname is referenced but not assigned.
(SC2154)
[warning] 40-40: stageName is referenced but not assigned.
(SC2154)
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around line 40, the echo uses an undefined
$hostname variable and includes an extra '=' before $gpuUuids; replace $hostname
with command substitution (e.g. $(hostname) or $(hostname -s) as appropriate),
remove the duplicated '=' so GPU_UUIDS = $gpuUuids, and optionally quote
variables in the echo to avoid word-splitting (e.g. "$stageName" and
"$gpuUuids").
| export LLM_MODELS_ROOT=$MODEL_CACHE_DIR | ||
| 
               | 
          ||
| export UCX_TLS=^gdr_copy | ||
| cd $llmSrcNode/tests/integration/defs | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure directory change failure stops the run.
Guard against a bad path to avoid running tests from the wrong dir.
-cd $llmSrcNode/tests/integration/defs
+cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cd $llmSrcNode/tests/integration/defs | |
| cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; } | 
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around line 49, the script does a plain cd to
$llmSrcNode/tests/integration/defs which will silently continue if the path is
invalid; change it to guard against failure by checking cd's exit status and
aborting the run on error (for example, replace the bare cd with an explicit
check such as attempting cd and on failure printing a clear error to stderr and
exiting with non‑zero status).
| trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2) | ||
| trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') | ||
| echo "TRTLLM WHEEL PATH: $trtllmWhlPath" | ||
| sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile" | ||
| testCmdLines+=( | ||
| "--cov=$llmSrcNode/examples/" | ||
| "--cov=$llmSrcNode/tensorrt_llm/" | ||
| "--cov=$trtllmWhlPath/tensorrt_llm/" | ||
| "--cov-report=" | ||
| "--cov-config=$coverageConfigFile" | ||
| ) | ||
| pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand") | ||
| 
               | 
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Robustly resolve tensorrt_llm location; current sed regex is ineffective.
sed 's/[[:space:]]+/_/g' treats + literally; whitespace isn’t trimmed. Use awk to extract/trim and fail fast if empty.
-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
+trtllmWhlPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}')
+[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; }
 pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2) | |
| trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') | |
| echo "TRTLLM WHEEL PATH: $trtllmWhlPath" | |
| sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile" | |
| testCmdLines+=( | |
| "--cov=$llmSrcNode/examples/" | |
| "--cov=$llmSrcNode/tensorrt_llm/" | |
| "--cov=$trtllmWhlPath/tensorrt_llm/" | |
| "--cov-report=" | |
| "--cov-config=$coverageConfigFile" | |
| ) | |
| pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand") | |
| trtllmWhlPath=$(pip3 show tensorrt_llm \ | |
| | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}') | |
| [ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; } | |
| echo "TRTLLM WHEEL PATH: $trtllmWhlPath" | |
| pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand") | 
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around lines 52-56, the sed invocation is wrong
(the '+' is literal) and doesn't reliably trim or extract the Location; replace
it with an awk-based extraction and trimming pipeline that finds the "Location:"
field from pip3 show, prints the value, and optionally replaces internal
whitespace with underscores if needed, then immediately fail fast (exit non-zero
with an error log) if the resulting variable is empty before continuing to set
pytestCommand. Ensure the script logs the resolved path and exits with a clear
error message when tensorrt_llm Location cannot be determined.
| echo "Full Command: $fullCmd" | ||
| eval $fullCmd | ||
| echo "Full Command: $pytestCommand" | ||
| eval $pytestCommand | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid eval; run the command safely while preserving env assignments.
eval risks unintended expansion. bash -lc is safer here.
-eval $pytestCommand
+bash -lc "$pytestCommand"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| eval $pytestCommand | |
| bash -lc "$pytestCommand" | 
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around line 83, avoid using eval to run the
pytestCommand because eval can cause unintended expansion; instead invoke a new
bash shell with -l -c and pass the entire pytestCommand string as a single
argument so environment variable assignments in the string are preserved and not
subject to eval expansion. Replace the eval usage with a call that runs the
command under bash -lc, ensuring the pytestCommand is passed as one quoted
argument and not re-evaluated by the current shell. Ensure any needed
environment assignments remain inside pytestCommand or are exported beforehand
so they are visible to the new shell.
| 
           PR_Github #16742 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
97-101: Normalize GPTOSS backend tags to lowercaseUppercase backend tags (
CUTLASS,TRITON,TRTLLM) are still present inl0_dgx_b200.ymland will cause parameter mismatches in the updated runner. Please apply the following changes:Files/locations to update:
- tests/integration/test_lists/test-db/l0_dgx_b200.yml, lines 97–101
 - - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-CUTLASS] - - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-TRITON] - - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[ep4-TRTLLM] - - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-CUTLASS] - - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-TRITON] + - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-cutlass] + - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-triton] + - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[ep4-trtllm] + - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-cutlass] + - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-triton]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tests/integration/test_lists/test-db/l0_dgx_b200.yml(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_b200.yml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (1)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
44-47: Good normalization of backend tags to lowercase.The tp/ep/dp backends for TestGPTOSS were lowercased (trtllm, cutlass, triton), which aligns with recent runner naming unification and reduces case-sensitivity pitfalls.
| 
           PR_Github #16749 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16749 [ run ] completed with state   | 
    
e69b735    to
    eeb92e1      
    Compare
  
    | 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
| 
           PR_Github #16818 [ run ] triggered by Bot  | 
    
| 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️  Outside diff range comments (1)
jenkins/scripts/slurm_run.sh (1)
69-79: Fix LD_LIBRARY_PATH augmentation; avoid broken sed and ensure idempotence.-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}') -containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g') -containerLDLibPath=$LD_LIBRARY_PATH -containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g') -if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then - containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath" - containerLDLibPath="${containerLDLibPath%:}" -fi -export LD_LIBRARY_PATH=$containerLDLibPath +containerPipLLMLibPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}') +containerLDLibPath="${LD_LIBRARY_PATH:-}" +case ":$containerLDLibPath:" in + *":$containerPipLLMLibPath:"*) ;; + *) containerLDLibPath="$containerPipLLMLibPath${containerLDLibPath:+:$containerLDLibPath}";; +esac +export LD_LIBRARY_PATH="$containerLDLibPath"
♻️ Duplicate comments (5)
jenkins/scripts/slurm_run.sh (5)
1-4: Fail fast on placeholder errors and enable strict mode.Uncaught errors from set_value_in_command can silently corrupt pytestCommand. Also adopt strict mode.
#!/bin/bash +set -euo pipefail +IFS=$'\n\t' set_value_in_command() { @@ - if [[ "$command" != *"$placeholder"* ]]; then - echo "Error: placeholder '$placeholder' not found in the command" >&2 - return 1 - fi + if [[ "$command" != *"$placeholder"* ]]; then + echo "Error: placeholder '$placeholder' not found in the command" >&2 + exit 1 + fi }Also applies to: 5-25
40-40: Fix undefined $hostname and extra '='; quote values.Use command substitution and remove the duplicate equals.
- echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" + echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = ${gpuUuids} ; STAGE_NAME = ${stageName:-unknown}"
49-49: Guard cd to fail early if path is wrong.-cd $llmSrcNode/tests/integration/defs +cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }
52-56: Robustly resolve TRT-LLM wheel location; current sed is ineffective.-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2) -trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') -echo "TRTLLM WHEEL PATH: $trtllmWhlPath" +trtllmWhlPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}') +[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; } +echo "TRTLLM WHEEL PATH: ${trtllmWhlPath}" pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")
82-83: Avoid eval; execute safely preserving env assignments.-echo "Full Command: $pytestCommand" -eval $pytestCommand +echo "Full Command: $pytestCommand" +bash -lc "$pytestCommand"
🧹 Nitpick comments (5)
jenkins/scripts/slurm_run.sh (2)
57-68: Quoting and path safety for generated .coveragerc.Use quotes to avoid word-splitting; keep consistent placeholder casing.
-cat << EOF > $jobWorkspace/.coveragerc +cat << EOF > "$jobWorkspace/.coveragerc" @@ -data_file = $jobWorkspace/.coverage.$stageName +data_file = $jobWorkspace/.coverage.$stageName @@ - $llmSrcNode/tensorrt_llm/ - $trtllmWhlPath/tensorrt_llm/ + $llmSrcNode/tensorrt_llm/ + $trtllmWhlPath/tensorrt_llm/ EOF -pytestCommand=$(set_value_in_command "coverageConfigFile" "$jobWorkspace/.coveragerc" "$pytestCommand") +pytestCommand=$(set_value_in_command "coverageConfigFile" "$jobWorkspace/.coveragerc" "$pytestCommand")
81-81: Drop or set unused $testCase.Variable is undefined; the log line adds noise.
-echo "Running: $testCase" +# echo "Running test case: $testCase" # Define testCase if needed or remove this line.jenkins/L0_Test.groovy (3)
172-199: Quoting in extraInternalEnv can break downstream export; keep env assigns simple.The token __LUNOWUD="...": when embedded into a double-quoted export later, it breaks quoting. Safer to avoid inner quotes in env assigns or ensure the command is injected as a file (see below).
- extraInternalEnv = "__LUNOWUD=\"-thread_pool_size=${TESTER_CORES}\"" + extraInternalEnv = "__LUNOWUD=-thread_pool_size=${TESTER_CORES}"If you must keep the quotes, apply the “write command to file” fix in the launch script comment below.
Also applies to: 200-229
379-387: Run launcher with explicit shell and error propagation.Use bash -lc to preserve env and stop on failure of srun.
- """bash ${scriptLaunchPathNode}""" + """bash -lc '${scriptLaunchPathNode}'"""
160-170: Minor: prefer RoundingMode for BigDecimal.ROUND_CEILING is deprecated; not blocking but easy to modernize.
- int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue() + int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, java.math.RoundingMode.CEILING).intValue()
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
jenkins/L0_Test.groovy(10 hunks)jenkins/scripts/slurm_run.sh(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh
[warning] 40-40: hostname is referenced but not assigned.
(SC2154)
[warning] 40-40: stageName is referenced but not assigned.
(SC2154)
[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.
(SC2164)
[warning] 85-85: perfMode is referenced but not assigned.
(SC2154)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (3)
jenkins/L0_Test.groovy (3)
161-170: Node/gpu arg calculation looks good.Ceiling split per node and per-task settings make sense for uneven GPU distributions.
1406-1417: Good: local (non-SLURM) pytest command assembly and rerun flow are coherent.The coverage config, LD_LIBRARY_PATH setup, and rerun integration are consistent.
Also applies to: 1439-1446
1818-1822: Verify new Slurm stage keys in bot stage-list and CI mapping
Ensure the DGX_B200-4_GPUs-PyTorch-{1,2,Post-Merge-1,Post-Merge-2} entries appear in/bot stage-listand in your CI’stest_to_stage_mapping; missing entries will trigger “stage not found.”
| def pytestCommand = getPytestBaseCommandLine( | ||
| llmSrcNode, | ||
| stageName, | ||
| testListPathNode, | ||
| splits, | ||
| splitId, | ||
| perfMode, | ||
| jobWorkspace, | ||
| "__PLACEHOLDER_TRTLLM_WHL_PATH__", | ||
| "__PLACEHOLDER_coverageConfigFile__", | ||
| pytestUtil | ||
| ).join(" ") | ||
| 
               | 
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Critical: exporting pytestCommand with embedded quotes corrupts the script; write command to a file and read it at runtime.
The generated command contains quotes (e.g., __LUNOWUD="..."), which terminates export pytestCommand="...". This is a likely cause of the CI failures (e.g., DGX_B200-4_GPUs-PyTorch-Post-Merge-1). Persist the command via here-doc and read it in slurm_run.sh.
-                def pytestCommand = getPytestBaseCommandLine(
+                def pytestCommand = getPytestBaseCommandLine(
                     llmSrcNode,
@@
                 ).join(" ")
@@
-                def scriptContent = """#!/bin/bash
+                def scriptContent = """#!/bin/bash
                     export jobWorkspace=$jobWorkspace
@@
-                    export nodeCount=$nodeCount
-                    export pytestCommand="$pytestCommand"
+                    export nodeCount=$nodeCount
+                    cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD'
+${pytestCommand}
+PYTEST_CMD
+                    export pytestCommand="$(cat "$jobWorkspace/.pytest_cmd")"
                     export NVIDIA_IMEX_CHANNELS=0
                     chmod +x $scriptRunNode
                 """No change needed in slurm_run.sh beyond replacing eval with bash -lc (already suggested). Optional: slurm_run.sh can prefer reading from $jobWorkspace/.pytest_cmd directly.
Also applies to: 337-350, 351-359, 360-368, 369-371
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 324-336 (and similarly for ranges
337-350, 351-359, 360-368, 369-371), the script currently exports pytestCommand
as a quoted string which gets corrupted by embedded quotes; instead write the
full generated command to a file in the job workspace (e.g., create jobWorkspace
+ "/.pytest_cmd" using a here-doc or direct write) and export only the path or a
simple variable, then ensure slurm_run.sh reads and executes that file at
runtime (replace any eval usage with bash -lc "$(<$jobWorkspace/.pytest_cmd)" or
have slurm_run.sh source/exec the file); update the code that builds
pytestCommand to write the command to the file and not attempt to embed complex
quoted content into a single exported environment variable.
| if [ "$perfMode" = "true" ]; then | ||
| if [[ "$stageName" == *PyTorch* ]]; then | ||
| basePerfFilename="base_perf_pytorch.csv" | ||
| else | ||
| basePerfFilename="base_perf.csv" | ||
| fi | ||
| basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename" | ||
| echo "Check perf result" | ||
| python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \ | ||
| $stageName/perf_script_test_results.csv \ | ||
| $basePerfPath | ||
| echo "Check perf report" | ||
| python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \ | ||
| --output_path $stageName/report.pdf \ | ||
| --files $stageName/perf_script_test_results.csv \ | ||
| $basePerfPath | ||
| fi | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix perf paths: use jobWorkspace instead of stageName on SLURM node.
Results are written under $jobWorkspace; current paths won’t exist remotely.
-    basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
+    basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
@@
-    python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \
-        $stageName/perf_script_test_results.csv \
-        $basePerfPath
+    python3 "$llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py" \
+        "$jobWorkspace/perf_script_test_results.csv" \
+        "$basePerfPath"
@@
-    python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \
-        --output_path $stageName/report.pdf \
-        --files $stageName/perf_script_test_results.csv \
-        $basePerfPath
+    python3 "$llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py" \
+        --output_path "$jobWorkspace/report.pdf" \
+        --files "$jobWorkspace/perf_script_test_results.csv" \
+        "$basePerfPath"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [ "$perfMode" = "true" ]; then | |
| if [[ "$stageName" == *PyTorch* ]]; then | |
| basePerfFilename="base_perf_pytorch.csv" | |
| else | |
| basePerfFilename="base_perf.csv" | |
| fi | |
| basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename" | |
| echo "Check perf result" | |
| python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \ | |
| $stageName/perf_script_test_results.csv \ | |
| $basePerfPath | |
| echo "Check perf report" | |
| python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \ | |
| --output_path $stageName/report.pdf \ | |
| --files $stageName/perf_script_test_results.csv \ | |
| $basePerfPath | |
| fi | |
| if [ "$perfMode" = "true" ]; then | |
| if [[ "$stageName" == *PyTorch* ]]; then | |
| basePerfFilename="base_perf_pytorch.csv" | |
| else | |
| basePerfFilename="base_perf.csv" | |
| fi | |
| basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename" | |
| echo "Check perf result" | |
| python3 "$llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py" \ | |
| "$jobWorkspace/perf_script_test_results.csv" \ | |
| "$basePerfPath" | |
| echo "Check perf report" | |
| python3 "$llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py" \ | |
| --output_path "$jobWorkspace/report.pdf" \ | |
| --files "$jobWorkspace/perf_script_test_results.csv" \ | |
| "$basePerfPath" | |
| fi | 
🧰 Tools
🪛 Shellcheck (0.10.0)
[warning] 85-85: perfMode is referenced but not assigned.
(SC2154)
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around lines 85 to 101, the perf script
currently uses $stageName for result and report paths which do not exist on the
SLURM node; change those references to $jobWorkspace (e.g. use
"$jobWorkspace/perf_script_test_results.csv" and "$jobWorkspace/report.pdf") and
update both python calls to pass the basePerfPath and jobWorkspace paths,
quoting variables to handle spaces, so the sanity_perf_check and
create_perf_comparison_report operate on files under the remote job workspace.
| 
           PR_Github #16820 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16818 [ run ] completed with state   | 
    
| 
           PR_Github #16820 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
| 
           PR_Github #16880 [ run ] triggered by Bot  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
324-336: Critical: exporting pytestCommand with embedded quotes corrupts the script; persist to file and read it at runtimeJoining the command and embedding it into
export pytestCommand="..."breaks when the content itself contains quotes (e.g., __LUNOWUD="..."). This matches the CI failures on DGX_B200 stages. Write the command to a file and read it back to avoid quoting/escaping hell. Also enable strict bash options.- def pytestCommand = getPytestBaseCommandLine( + def pytestCommand = getPytestBaseCommandLine( llmSrcNode, stageName, testListPathNode, splits, splitId, perfMode, jobWorkspace, "__PLACEHOLDER_TRTLLM_WHL_PATH__", "__PLACEHOLDER_coverageConfigFile__", pytestUtil ).join(" ") @@ - def scriptContent = """#!/bin/bash + def scriptContent = """#!/bin/bash + set -euo pipefail export jobWorkspace=$jobWorkspace export tarName=$tarName export llmTarfile=$llmTarfile export llmSrcNode=$llmSrcNode export stageName=$stageName export perfMode=$perfMode export resourcePathNode=$resourcePathNode export nodeCount=$nodeCount - export pytestCommand="$pytestCommand" + # Persist full pytest command to avoid export-time quoting issues + cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD' +${pytestCommand} +PYTEST_CMD + export PYTEST_CMD_FILE="$jobWorkspace/.pytest_cmd" export NVIDIA_IMEX_CHANNELS=0 chmod +x $scriptRunNode """Follow-up: slurm_run.sh should prefer reading and executing the command from
$PYTEST_CMD_FILEif set (or replaceevalwithbash -lc "$(<"$PYTEST_CMD_FILE")"). I can send a PR to slurm_run.sh if needed.#!/bin/bash # Show vulnerable exports and ensure the new file-based approach is present rg -nP --type=groovy -C2 'export\s+pytestCommand="?[^"]*"\$' jenkins/L0_Test.groovy rg -nP --type=groovy -C2 '\.pytest_cmd' jenkins/L0_Test.groovy # Inspect slurm_run.sh for 'eval' vs 'bash -lc' and optional file read rg -n 'pytestCommand' jenkins/scripts/slurm_run.sh rg -n 'eval' jenkins/scripts/slurm_run.shAlso applies to: 337-371
🧹 Nitpick comments (6)
jenkins/L0_Test.groovy (6)
1-1: Avoid pinning to a personal/dev shared-lib branch in mainline CIUse a stable tag/branch or make it configurable; dev branches can break CI unpredictably.
135-157: Cleanup should be best-effort and non-fatalIf the workspace was never created or already removed,
rm -rferrors could fail the stage. Add|| true.- "rm -rf /home/svc_tensorrt/bloom/scripts/${jobUID}" + "rm -rf /home/svc_tensorrt/bloom/scripts/${jobUID} || true"
161-170: Slurm args: single-node should use --gpus-per-node; uneven GPU splits need verification
- For nodeCount == 1, prefer
 --gpus-per-nodeto align with containerized srun setups.- With uneven splits (e.g., 5 GPUs over 2 nodes), ceiled
 gpusPerNode=3with--ntasks=5and--ntasks-per-node=3might overconstrain placement. Validate on your cluster.- return nodeCount == 1 ? [ - "--nodes=${nodeCount}", - "--gpus=${gpuCount}" - ] : [ + return nodeCount == 1 ? [ + "--nodes=${nodeCount}", + "--gpus-per-node=${gpuCount}" + ] : [ "--nodes=${nodeCount}", "--ntasks=${gpuCount}", "--ntasks-per-node=${gpusPerNode}", "--gpus-per-node=${gpusPerNode}", ]
172-230: getPytestBaseCommandLine: avoid hidden global state and fragile quoting
- This method reads
 testFilter[(DETAILED_LOG)]implicitly. Prefer an explicit parameter for determinism and testability.- Returning a token array is good; the later export/quoting is where it breaks (see next comment).
 If you want to decouple from globals with minimal churn:
-def getPytestBaseCommandLine( +def getPytestBaseCommandLine( String llmSrc, String stageName, String testDBList, int splits, int split_id, Boolean perfMode, String outputPath, String trtllmWheelPath, String coverageConfigFile, - String pytestUtil = "" + String pytestUtil = "", + Boolean detailedLog = testFilter[(DETAILED_LOG)] ) { @@ - testFilter[(DETAILED_LOG)] ? "-s" : "", + detailedLog ? "-s" : "",
350-368: Harden launch script and remove unused local var
- Add strict bash flags (already added above).
 - The local
 outputPathvariable is unused; remove it.- } else { - String outputPath = "${jobWorkspace}/job-output.log" + } else { taskArgs = [ *taskArgs, ]
232-233: Unused parameters in runLLMTestlistOnSlurm
skipInstallWheelandcpverare unused in this method. Either use them or drop to avoid confusion.-def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, perfMode=false, stageName="Undefined", splitId=1, splits=1, gpuCount=1, nodeCount=1, skipInstallWheel=false, cpver="cp312") +def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, perfMode=false, stageName="Undefined", splitId=1, splits=1, gpuCount=1, nodeCount=1)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (10)
jenkins/L0_Test.groovy (10)
266-288: LGTM: remote script stagingCopying slurm_run.sh and the rendered test list to the FE node is clear and retriable.
306-318: LGTM: containerized srun assemblyUsing
--container-image,--container-workdir, and bind mounts looks correct for trt-llm containerizedsrun.
319-335: PyTest launcher selection for multi-node is reasonableUsing
trtllm-llmapi-launchonly when nodeCount > 1 makes sense.
378-387: LGTM: remote executionExecuting the pre-staged launcher over SSH is straightforward and retriable via Utils.exec.
391-392: LGTM: result upload then cleanup in finallyOrdering is correct; ensures artifacts are collected even on failures.
1389-1417: LGTM: on-box pytest assembly with coverageLocal path uses array form (no export), avoiding quoting issues. Coverage config generation looks fine.
1438-1441: LGTM: rerun flow integrationRunning pytest, then conditional rerun via parsed fail signatures is wired correctly.
Also applies to: 1445-1445
1818-1822: Verify new DGX_B200 Slurm mapping and partition labelEnsure "b200-trtllm" exists in SlurmConfig and the hardware matches 4-GPU expectations. These stages are the ones failing via /bot; fix in quoting (above) should unblock, but please double-check partition naming.
1847-1850: Verify GB200 single-node Slurm mappingConfirm "gb200-trtllm" partition is available and correctly wired to the intended rack/pool.
1854-1862: Multi-node SBSA configs: sanity-check MPI settings and resource math
- Using
 --mpi=pmi2above: confirm PMI2 is available on these clusters.- 8 GPUs across 2 nodes implies 4 per-node; ensure
 getNodeArgsand cluster constraints match.Also applies to: 1892-1893
| 
           PR_Github #16880 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
| 
           PR_Github #16890 [ run ] triggered by Bot  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)
324-336: Critical: exporting pytestCommand with embedded quotes will break the script; persist to a file and consume at runtime.The command contains quotes (e.g., __LUNOWUD="..."), which corrupts
export pytestCommand="...". Write the full command to a file and have slurm_run.sh read/execute it.- def pytestCommand = getPytestBaseCommandLine( + def pytestCommand = getPytestBaseCommandLine( @@ - def scriptContent = """#!/bin/bash + def scriptContent = """#!/bin/bash + set -euo pipefail export jobWorkspace=$jobWorkspace @@ - export nodeCount=$nodeCount - export pytestCommand="$pytestCommand" - export NVIDIA_IMEX_CHANNELS=0 - chmod +x $scriptRunNode + export nodeCount=$nodeCount + export NVIDIA_IMEX_CHANNELS=0 + # Persist pytest command verbatim to avoid quote corruption + cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD' +${pytestCommand} +PYTEST_CMD + export PYTEST_CMD_FILE="$jobWorkspace/.pytest_cmd" + chmod +x "$scriptRunNode" @@ - scriptContent = scriptContent.replaceAll('\t','').stripIndent() + scriptContent = scriptContent.stripIndent()Companion change required in jenkins/scripts/slurm_run.sh (outside this file):
- Replace any eval usage with:
 bash -lc "$(<"${PYTEST_CMD_FILE}")"- Or read the file into a variable and execute with
 bash -lc "$cmd".#!/bin/bash # Verify slurm_run.sh does not use eval and can read PYTEST_CMD_FILE rg -nC2 -e 'eval\s' -e 'PYTEST_CMD_FILE' jenkins/scripts/slurm_run.sh || trueAlso applies to: 337-349, 350-368, 369-371
🧹 Nitpick comments (4)
jenkins/L0_Test.groovy (4)
172-230: Make getPytestBaseCommandLine robust: drop empty args; avoid hidden globals.
- Filter out empty strings so argv is clean (pytestUtil may be empty).
 - Optional (nice): pass a detailedLog flag instead of reading global testFilter.
 def getPytestBaseCommandLine( @@ - testFilter[(DETAILED_LOG)] ? "-s" : "", + testFilter[(DETAILED_LOG)] ? "-s" : "", @@ - return testCmdLine as String[] + // remove blanks/nulls + testCmdLine = testCmdLine.findAll { it?.toString().trim() } + return testCmdLine as String[]
319-323: Avoid injecting empty argv element for pytestUtil.Minor cleanup: don’t pass a blank arg.
- String pytestUtil = "" - if (nodeCount > 1) { - pytestUtil = "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch" - } + String pytestUtil = (nodeCount > 1) ? "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch" : null(Note: with the previous argv filtering in getPytestBaseCommandLine, nulls/blanks are dropped.)
360-367: Remove unused variable.outputPath is unused.
- String outputPath = "${jobWorkspace}/job-output.log"
1405-1417: Local run: argv construction is good; optional: align with file-based command execution for parity.Not required, but you can mirror the file-based approach here for consistency and easier debugging.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
 - Jira integration is disabled by default for public repositories
 - Linear integration is disabled by default for public repositories
 
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
 
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)
1818-1822: Ensure SlurmConfig & mapping script support new b200-trtllm/gb200-trtllm keys. Add these labels to SlurmConfig.partitionConfig and clusterConfig, and verify the test-to-stage mapping script (path tests/integration/defs/scripts/test_to_stage_mapping.py appears missing) correctly resolves the keys at L0_Test.groovy lines 1818–1822, 1847–1850, 1855–1862 and 1892–1893.
| @@ -1,4 +1,4 @@ | |||
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | |||
| @Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Do not merge with a dev shared library reference. Revert to main.
Pinning to a feature branch will break main after merge.
-@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _🏁 Script executed:
#!/bin/bash
# Verify no other dev shared-lib references exist
grep -R "dev-yuanjingx-slurm_refactor" -n .
# Confirm @Library usage in the Jenkinsfile under review
grep -R "@Library" -n jenkins/L0_Test.groovy || true
# Preview the first few lines of jenkins/L0_Test.groovy
head -n 5 jenkins/L0_Test.groovyLength of output: 564
Revert bloom-jenkins-shared-lib to @main
Pinning to dev-yuanjingx-slurm_refactor will break the main pipeline—update line 1 accordingly.
-@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main',                    'trtllm-jenkins-shared-lib@main']) _📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _ | |
| @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ | 
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the Jenkins shared library reference is
pinned to dev-yuanjingx-slurm_refactor which will break the main pipeline;
change the library ref to use the stable main branch by replacing
'@dev-yuanjingx-slurm_refactor' with '@main' so the line becomes
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
and save the file.
| return [ | ||
| return nodeCount == 1 ? [ | ||
| "--nodes=${nodeCount}", | ||
| "--gpus=${gpuCount}" | ||
| ] : [ | ||
| "--nodes=${nodeCount}", | ||
| "--ntasks=${gpuCount}", | ||
| "--ntasks-per-node=${gpusPerNode}", | ||
| "--gpus-per-node=${gpusPerNode}", | ||
| ].join(" ") | ||
| ] | ||
| } | ||
| 
               | 
          
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
❓ Verification inconclusive
Harden getNodeArgs: handle invalid inputs; avoid deprecated BigDecimal rounding.
Avoid ROUND_CEILING and return null for invalid node/gpu counts (caller already checks for null).
-def getNodeArgs(int nodeCount, int gpuCount) {
-    int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
+def getNodeArgs(int nodeCount, int gpuCount) {
+    if (nodeCount < 1 || gpuCount < 1) {
+        return null
+    }
+    int gpusPerNode = Math.ceil(gpuCount / (double) nodeCount) as int
     return nodeCount == 1 ? [
         "--nodes=${nodeCount}",
         "--gpus=${gpuCount}"
     ] : [
         "--nodes=${nodeCount}",
         "--ntasks=${gpuCount}",
         "--ntasks-per-node=${gpusPerNode}",
         "--gpus-per-node=${gpusPerNode}",
     ]
 }Harden getNodeArgs: validate inputs & avoid deprecated BigDecimal rounding.
- Guard against invalid counts (
nodeCount < 1 || gpuCount < 1) by returningnull. - Replace
((gpuCount/nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
with
Math.ceil(gpuCount/(double)nodeCount) as int. 
Verify that every caller properly handles a null return from getNodeArgs.
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 160 to 170, harden getNodeArgs by first
validating inputs and returning null when nodeCount < 1 || gpuCount < 1, replace
the deprecated BigDecimal rounding expression with
Math.ceil(gpuCount/(double)nodeCount) cast to int for gpusPerNode, and update
callers to handle a possible null return (e.g., check for null and abort or
propagate error) so invalid inputs don't produce incorrect args.
| @@ -357,80 +277,118 @@ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t pass the password on the command line. Use sshpass -e.
Prevents leaking secrets via process list/logging.
-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| stage("[${stageName}] Initializing Test") { | |
| // Create Job Workspace folder in Frontend Node | |
| Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,) | |
| stage("[${stageName}] Initializing Test") { | |
| // Create Job Workspace folder in Frontend Node | |
| Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,) | 
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 270 to 273, do not interpolate the
plaintext password into the sshpass command; instead switch to sshpass -e and
supply the password via the SSHPASS environment variable (use Jenkins
credentials binding or withEnv) so the password is not visible on the command
line or process list. Update the Utils.exec invocation to remove
'${remote.passwd}' from the command string and ensure remote.passwd is injected
into the process environment (or retrieved from Jenkins credentials) before
running sshpass -e ssh ..., keeping the same COMMON_SSH_OPTIONS, remote
user/host and mkdir -p jobWorkspace and retaining numRetries.
| } | ||
| } | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Re-enable cleanup, gated by a debug flag.
Leaving remote workspaces indefinitely will leak disk.
-        //cleanUpNodeResources(pipeline, cluster, jobUID)
+        if (!(env.KEEP_SLURM_WS_FOR_DEBUG?.toBoolean())) {
+            cleanUpNodeResources(pipeline, cluster, jobUID)
+        } else {
+            echo "KEEP_SLURM_WS_FOR_DEBUG=true; preserving remote workspace for ${jobUID}"
+        }🏁 Script executed:
#!/bin/bash
# Locate definition of the cleanup helper
rg -n 'def cleanUpNodeResources' -g '*.groovy'
# Find all invocations of the cleanup call
rg -n 'cleanUpNodeResources' -g '*.groovy'
# Check for existing debug flag usage or declaration
rg -n 'KEEP_SLURM_WS_FOR_DEBUG' -g '*.groovy'Length of output: 430
Re-enable cleanup behind a debug flag in jenkins/L0_Test.groovy:391
Replace the commented-out call with:
if (!env.KEEP_SLURM_WS_FOR_DEBUG?.toBoolean()) {
    cleanUpNodeResources(pipeline, cluster, jobUID)
} else {
    echo "KEEP_SLURM_WS_FOR_DEBUG=true; preserving remote workspace for ${jobUID}"
}Also add and document the KEEP_SLURM_WS_FOR_DEBUG environment variable or pipeline parameter.
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 391-392, re-enable node cleanup but guard
it with a debug flag: replace the commented-out cleanUpNodeResources call with a
conditional that checks env.KEEP_SLURM_WS_FOR_DEBUG (toBoolean) and only calls
cleanUpNodeResources(pipeline, cluster, jobUID) when the flag is false;
otherwise emit an echo that the workspace is preserved for the jobUID. Also add
a documented pipeline parameter or environment variable definition for
KEEP_SLURM_WS_FOR_DEBUG (default false) in the pipeline configuration or README
so users know how to enable workspace preservation for debugging.
| 
           PR_Github #16890 [ run ] completed with state   | 
    
| 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
19ffc0d    to
    adcb573      
    Compare
  
    | 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
| 
           PR_Github #16968 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16968 [ run ] completed with state   | 
    
…convert to Git LFS
consolidate test code for slurm jobs and regular jobs Signed-off-by: Yuanjing Xue <[email protected]>
Signed-off-by: Yuanjing Xue <[email protected]>
adcb573    to
    061fcba      
    Compare
  
    | 
           /bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast  | 
    
| 
           PR_Github #16999 [ run ] triggered by Bot  | 
    
| 
           PR_Github #16999 [ run ] completed with state   | 
    
Summary by CodeRabbit
New Features
Refactor
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.