Skip to content

Conversation

@yuanjingx87
Copy link
Collaborator

@yuanjingx87 yuanjingx87 commented Aug 27, 2025

Summary by CodeRabbit

  • New Features

    • Unified SLURM test execution using a scripted remote launch for single- and multi-node runs.
    • Automatic performance sanity checks and generation of comparison reports in perf mode.
    • More robust pytest command assembly supporting test-splitting, perf mode, coverage, and reruns.
  • Refactor

    • Consolidated resource cleanup and unified run staging/labels.
    • Dynamic coverage handling including wheel-derived paths and runtime library path adjustments.
    • Updated test runtime mappings to TRT-LLM targets.
  • Chores

    • Adjusted exported environment variables (added UCX_TLS; removed several legacy exports).

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yuanjingx87 yuanjingx87 requested review from a team as code owners August 27, 2025 19:07
@yuanjingx87 yuanjingx87 marked this pull request as draft August 27, 2025 19:07
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

📝 Walkthrough

Walkthrough

Unifies SLURM test execution into a node-count–aware API, adds helpers to assemble pytest invocations and node args, switches multi-node launch to a generated script executed remotely, remaps runtime labels to TRT-LLM variants, and refactors slurm_run.sh to template/evaluate a single pytestCommand with coverage and perf checks.

Changes

Cohort / File(s) Summary of Changes
Jenkins SLURM orchestration
jenkins/L0_Test.groovy
Renamed and unified cleanup and run entrypoints (cleanUpNodeResources, runLLMTestlistOnSlurm), added getPytestBaseCommandLine and getNodeArgs, switched to generating/copying/executing a slurm_launch.sh for multi-node runs, updated rerun logic, stage labels, and runtime mappings to TRT-LLM variants.
Slurm launch script
jenkins/scripts/slurm_run.sh
Added set_value_in_command helper; replaced array-of-commands with a single templated pytestCommand (injected via placeholders and executed with eval); dynamically generate .coveragerc; compute/inject TRT-LLM wheel path; augment LD_LIBRARY_PATH; log hostname; add perf-mode sanity checks and perf report generation; adjusted exported env vars (add UCX_TLS=^gdr_copy, remove several previous exports).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Jenkins as Jenkins (L0_Test.groovy)
  participant Slurm as SlurmCluster
  participant Node as Remote Node (SSH)
  participant Launch as slurm_launch.sh
  participant PyTest as pytest

  Jenkins->>Jenkins: build pytestCommand via getPytestBaseCommandLine
  Jenkins->>Slurm: request allocation (nodeCount, gpu args via getNodeArgs)
  Jenkins->>Node: copy generated slurm_launch.sh and support files
  Jenkins->>Node: SSH -> execute slurm_launch.sh
  Node->>Launch: start, set env, inject placeholders via set_value_in_command
  Launch->>PyTest: eval pytestCommand
  PyTest-->>Launch: exit/status, produce coverage/artifacts
  Launch->>Launch: run optional perf checks, generate report
  Launch-->>Jenkins: exit status and artifacts
  Jenkins->>Slurm: cleanUpNodeResources (release)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • tburt-nv
  • chzblych
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16742 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/scripts/slurm_run.sh (1)

69-76: Trim values; don’t mangle paths.

Avoid replacing spaces with underscores; just read and prepend cleanly.

-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
+containerPipLLMLibPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}')
+containerLDLibPath="${LD_LIBRARY_PATH}"
 if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then
   containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath"
   containerLDLibPath="${containerLDLibPath%:}"
 fi
 export LD_LIBRARY_PATH=$containerLDLibPath
🧹 Nitpick comments (7)
jenkins/scripts/slurm_run.sh (2)

34-34: Install reliability: run apt-get update before installing.

Precede install with update to reduce transient failures.

-    apt-get install -y libffi-dev
+    apt-get update -y && apt-get install -y libffi-dev

81-81: Remove or fix unused variable in log.

$testCase is never set.

-echo "Running: $testCase"
+echo "Running stage: $stageName"
jenkins/L0_Test.groovy (5)

159-170: Validate Slurm task/GPU mapping; consider adding --gpus-per-task=1.

With ceil() on gpusPerNode and --ntasks=${gpuCount}, tasks/GPU mapping can drift on uneven splits. Make binding explicit.

     return nodeCount == 1 ? [
         "--nodes=${nodeCount}",
         "--gpus=${gpuCount}"
     ] : [
         "--nodes=${nodeCount}",
         "--ntasks=${gpuCount}",
         "--ntasks-per-node=${gpusPerNode}",
         "--gpus-per-node=${gpusPerNode}",
+        "--gpus-per-task=1",
     ]

If your Slurm defaults already bind GPUs per task, please confirm.


172-230: Command builder looks solid; minor nit on empty args.

Joining includes empty strings when not detailed logging; harmless but can be filtered to keep logs cleaner.

-        testFilter[(DETAILED_LOG)] ? "-s" : "",
+        *(testFilter[(DETAILED_LOG)] ? ["-s"] : []),

266-287: Ensure slurm_run.sh exists from the unpacked tree.

You extract the tar locally to copy slurm_run.sh. If packaging ever changes, this path breaks. Add a quick existence check to fail early.

-                def scriptRunLocalPath = "${llmSrcLocal}/jenkins/scripts/slurm_run.sh"
+                def scriptRunLocalPath = "${llmSrcLocal}/jenkins/scripts/slurm_run.sh"
+                if (!fileExists(scriptRunLocalPath)) {
+                    error "slurm_run.sh not found at ${scriptRunLocalPath}"
+                }

312-318: Quoting safety for exporting pytestCommand into the launch script.

Long commands with quotes can break the here-doc export. Consider base64-encoding and decoding in slurm_run.sh, or at least escape embedded quotes.

Example (conceptual):

def cmdB64 = pytestCommand.bytes.encodeBase64().toString()
scriptContent = """#!/bin/bash
export pytestCommandB64="${cmdB64}"
...
"""
// In slurm_run.sh: pytestCommand="$(echo "$pytestCommandB64" | base64 -d)"

Also applies to: 337-349


379-387: SSH execution path: add basic retry to handle transient SSH/SRUN flakiness.

Occasional transient remote failures are common; a short retry loop improves robustness.

-                        """bash ${scriptLaunchPathNode}"""
+                        """bash ${scriptLaunchPathNode}""" ,

And wrap Utils.exec with numRetries similar to earlier copy steps.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8a619be and 6b19ad1.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (10 hunks)
  • jenkins/scripts/slurm_run.sh (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 40-40: hostname is referenced but not assigned.

(SC2154)


[warning] 40-40: stageName is referenced but not assigned.

(SC2154)


[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[warning] 85-85: perfMode is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
jenkins/L0_Test.groovy (3)

351-359: MPI flag only for multi-node: LGTM.

Using --mpi=pmi2 conditionally makes sense with the containerized srun flow.

Also applies to: 360-368


1390-1417: Unified pytest command path: LGTM.

The builder plus rerun logic integration looks correct and readable.

Also applies to: 1438-1446


1818-1822: Stage mappings updated to TRT-LLM variants: LGTM.

New x86/SBSA Slurm configs and multi-node stages align with the new runner.

Also applies to: 1847-1851, 1854-1862

Comment on lines 5 to 19
set_value_in_command() {
# Parameters
local key="$1"
local value="$2"
local command="$3"

# Transform the key
local placeholder="__PLACEHOLDER_${key}__"

# Check if placeholder exists
if [[ "$command" != *"$placeholder"* ]]; then
echo "Error: placeholder '$placeholder' not found in the command" >&2
return 1
fi

# Replace all occurrences
local result="${command//${placeholder}/${value}}"

# Return the result
echo "$result"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fail fast on missing placeholders and surface errors.

set_value_in_command returns 1 if not found; callers don’t check, risking an empty pytestCommand. Either exit on error or handle status.

 set_value_in_command() {
@@
-    if [[ "$command" != *"$placeholder"* ]]; then
-        echo "Error: placeholder '$placeholder' not found in the command" >&2
-        return 1
-    fi
+    if [[ "$command" != *"$placeholder"* ]]; then
+        echo "Error: placeholder '$placeholder' not found in the command" >&2
+        exit 1
+    fi
 }

Optionally also enable strict mode near the top:

 #!/bin/bash
+set -euo pipefail

Also applies to: 55-55, 67-67

🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around lines 5 to 25 (also apply same treatment
at lines 55 and 67), the helper set_value_in_command currently returns a
non-zero status when a placeholder is missing but callers do not check the
return code, which can lead to unset/empty command variables; change the
function to fail-fast by printing the error to stderr and then calling exit 1
(or alternatively propagate the error by returning non-zero and ensure callers
immediately check and exit on failure), and enable strict mode near the top of
the script (set -euo pipefail and optionally IFS=$'\n\t') so any unexpected
failure halts execution; apply the same error-handling change to the other
placeholder-replacement usages at lines 55 and 67 to ensure missing placeholders
abort the run.

git config --global --add safe.directory "*"
gpuUuids=$(nvidia-smi -q | grep "GPU UUID" | awk '{print $4}' | tr '\n' ',' || true)
echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix undefined hostname variable and extra '=' in log line.

Use command substitution; drop the duplicate '='.

-    echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 40-40: hostname is referenced but not assigned.

(SC2154)


[warning] 40-40: stageName is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around line 40, the echo uses an undefined
$hostname variable and includes an extra '=' before $gpuUuids; replace $hostname
with command substitution (e.g. $(hostname) or $(hostname -s) as appropriate),
remove the duplicated '=' so GPU_UUIDS = $gpuUuids, and optionally quote
variables in the echo to avoid word-splitting (e.g. "$stageName" and
"$gpuUuids").

export LLM_MODELS_ROOT=$MODEL_CACHE_DIR

export UCX_TLS=^gdr_copy
cd $llmSrcNode/tests/integration/defs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure directory change failure stops the run.

Guard against a bad path to avoid running tests from the wrong dir.

-cd $llmSrcNode/tests/integration/defs
+cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cd $llmSrcNode/tests/integration/defs
cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around line 49, the script does a plain cd to
$llmSrcNode/tests/integration/defs which will silently continue if the path is
invalid; change it to guard against failure by checking cd's exit status and
aborting the run on error (for example, replace the bare cd with an explicit
check such as attempting cd and on failure printing a clear error to stderr and
exiting with non‑zero status).

Comment on lines 52 to +50
trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
testCmdLines+=(
"--cov=$llmSrcNode/examples/"
"--cov=$llmSrcNode/tensorrt_llm/"
"--cov=$trtllmWhlPath/tensorrt_llm/"
"--cov-report="
"--cov-config=$coverageConfigFile"
)
pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Robustly resolve tensorrt_llm location; current sed regex is ineffective.

sed 's/[[:space:]]+/_/g' treats + literally; whitespace isn’t trimmed. Use awk to extract/trim and fail fast if empty.

-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
+trtllmWhlPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}')
+[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; }
 pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
testCmdLines+=(
"--cov=$llmSrcNode/examples/"
"--cov=$llmSrcNode/tensorrt_llm/"
"--cov=$trtllmWhlPath/tensorrt_llm/"
"--cov-report="
"--cov-config=$coverageConfigFile"
)
pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")
trtllmWhlPath=$(pip3 show tensorrt_llm \
| awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}')
[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; }
echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around lines 52-56, the sed invocation is wrong
(the '+' is literal) and doesn't reliably trim or extract the Location; replace
it with an awk-based extraction and trimming pipeline that finds the "Location:"
field from pip3 show, prints the value, and optionally replaces internal
whitespace with underscores if needed, then immediately fail fast (exit non-zero
with an error log) if the resulting variable is empty before continuing to set
pytestCommand. Ensure the script logs the resolved path and exits with a clear
error message when tensorrt_llm Location cannot be determined.

echo "Full Command: $fullCmd"
eval $fullCmd
echo "Full Command: $pytestCommand"
eval $pytestCommand
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid eval; run the command safely while preserving env assignments.

eval risks unintended expansion. bash -lc is safer here.

-eval $pytestCommand
+bash -lc "$pytestCommand"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
eval $pytestCommand
bash -lc "$pytestCommand"
🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around line 83, avoid using eval to run the
pytestCommand because eval can cause unintended expansion; instead invoke a new
bash shell with -l -c and pass the entire pytestCommand string as a single
argument so environment variable assignments in the string are preserved and not
subject to eval expansion. Replace the eval usage with a call that runs the
command under bash -lc, ensuring the pytestCommand is passed as one quoted
argument and not re-evaluated by the current shell. Ensure any needed
environment assignments remain inside pytestCommand or are exported beforehand
so they are visible to the new shell.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16742 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12565 (Partly Tested) completed with status: 'FAILURE'

@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

97-101: Normalize GPTOSS backend tags to lowercase

Uppercase backend tags (CUTLASS, TRITON, TRTLLM) are still present in l0_dgx_b200.yml and will cause parameter mismatches in the updated runner. Please apply the following changes:

Files/locations to update:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml, lines 97–101
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-CUTLASS]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-TRITON]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[ep4-TRTLLM]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-CUTLASS]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-TRITON]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-cutlass]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-triton]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[ep4-trtllm]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-cutlass]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-triton]
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6b19ad1 and e69b735.

📒 Files selected for processing (1)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

44-47: Good normalization of backend tags to lowercase.

The tp/ep/dp backends for TestGPTOSS were lowercased (trtllm, cutlass, triton), which aligns with recent runner naming unification and reduces case-sensitivity pitfalls.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16749 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16749 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12571 (Partly Tested) completed with status: 'FAILURE'

@yuanjingx87 yuanjingx87 force-pushed the user/yuanjingx/slurm_test_tmp branch from e69b735 to eeb92e1 Compare August 28, 2025 07:33
@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16818 [ run ] triggered by Bot

@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/scripts/slurm_run.sh (1)

69-79: Fix LD_LIBRARY_PATH augmentation; avoid broken sed and ensure idempotence.

-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
-if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then
-  containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath"
-  containerLDLibPath="${containerLDLibPath%:}"
-fi
-export LD_LIBRARY_PATH=$containerLDLibPath
+containerPipLLMLibPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}')
+containerLDLibPath="${LD_LIBRARY_PATH:-}"
+case ":$containerLDLibPath:" in
+  *":$containerPipLLMLibPath:"*) ;;
+  *) containerLDLibPath="$containerPipLLMLibPath${containerLDLibPath:+:$containerLDLibPath}";;
+esac
+export LD_LIBRARY_PATH="$containerLDLibPath"
♻️ Duplicate comments (5)
jenkins/scripts/slurm_run.sh (5)

1-4: Fail fast on placeholder errors and enable strict mode.

Uncaught errors from set_value_in_command can silently corrupt pytestCommand. Also adopt strict mode.

 #!/bin/bash
+set -euo pipefail
+IFS=$'\n\t'
 
 set_value_in_command() {
@@
-    if [[ "$command" != *"$placeholder"* ]]; then
-        echo "Error: placeholder '$placeholder' not found in the command" >&2
-        return 1
-    fi
+    if [[ "$command" != *"$placeholder"* ]]; then
+        echo "Error: placeholder '$placeholder' not found in the command" >&2
+        exit 1
+    fi
 }

Also applies to: 5-25


40-40: Fix undefined $hostname and extra '='; quote values.

Use command substitution and remove the duplicate equals.

-    echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = ${gpuUuids} ; STAGE_NAME = ${stageName:-unknown}"

49-49: Guard cd to fail early if path is wrong.

-cd $llmSrcNode/tests/integration/defs
+cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }

52-56: Robustly resolve TRT-LLM wheel location; current sed is ineffective.

-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
-echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
+trtllmWhlPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}')
+[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; }
+echo "TRTLLM WHEEL PATH: ${trtllmWhlPath}"
 pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")

82-83: Avoid eval; execute safely preserving env assignments.

-echo "Full Command: $pytestCommand"
-eval $pytestCommand
+echo "Full Command: $pytestCommand"
+bash -lc "$pytestCommand"
🧹 Nitpick comments (5)
jenkins/scripts/slurm_run.sh (2)

57-68: Quoting and path safety for generated .coveragerc.

Use quotes to avoid word-splitting; keep consistent placeholder casing.

-cat << EOF > $jobWorkspace/.coveragerc
+cat << EOF > "$jobWorkspace/.coveragerc"
@@
-data_file = $jobWorkspace/.coverage.$stageName
+data_file = $jobWorkspace/.coverage.$stageName
@@
-    $llmSrcNode/tensorrt_llm/
-    $trtllmWhlPath/tensorrt_llm/
+    $llmSrcNode/tensorrt_llm/
+    $trtllmWhlPath/tensorrt_llm/
 EOF
-pytestCommand=$(set_value_in_command "coverageConfigFile" "$jobWorkspace/.coveragerc" "$pytestCommand")
+pytestCommand=$(set_value_in_command "coverageConfigFile" "$jobWorkspace/.coveragerc" "$pytestCommand")

81-81: Drop or set unused $testCase.

Variable is undefined; the log line adds noise.

-echo "Running: $testCase"
+# echo "Running test case: $testCase" # Define testCase if needed or remove this line.
jenkins/L0_Test.groovy (3)

172-199: Quoting in extraInternalEnv can break downstream export; keep env assigns simple.

The token __LUNOWUD="...": when embedded into a double-quoted export later, it breaks quoting. Safer to avoid inner quotes in env assigns or ensure the command is injected as a file (see below).

-    extraInternalEnv = "__LUNOWUD=\"-thread_pool_size=${TESTER_CORES}\""
+    extraInternalEnv = "__LUNOWUD=-thread_pool_size=${TESTER_CORES}"

If you must keep the quotes, apply the “write command to file” fix in the launch script comment below.

Also applies to: 200-229


379-387: Run launcher with explicit shell and error propagation.

Use bash -lc to preserve env and stop on failure of srun.

-                        """bash ${scriptLaunchPathNode}"""
+                        """bash -lc '${scriptLaunchPathNode}'"""

160-170: Minor: prefer RoundingMode for BigDecimal.

ROUND_CEILING is deprecated; not blocking but easy to modernize.

-    int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
+    int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, java.math.RoundingMode.CEILING).intValue()
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e69b735 and eeb92e1.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy (10 hunks)
  • jenkins/scripts/slurm_run.sh (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • jenkins/scripts/slurm_run.sh
🪛 Shellcheck (0.10.0)
jenkins/scripts/slurm_run.sh

[warning] 40-40: hostname is referenced but not assigned.

(SC2154)


[warning] 40-40: stageName is referenced but not assigned.

(SC2154)


[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)


[warning] 85-85: perfMode is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
jenkins/L0_Test.groovy (3)

161-170: Node/gpu arg calculation looks good.

Ceiling split per node and per-task settings make sense for uneven GPU distributions.


1406-1417: Good: local (non-SLURM) pytest command assembly and rerun flow are coherent.

The coverage config, LD_LIBRARY_PATH setup, and rerun integration are consistent.

Also applies to: 1439-1446


1818-1822: Verify new Slurm stage keys in bot stage-list and CI mapping
Ensure the DGX_B200-4_GPUs-PyTorch-{1,2,Post-Merge-1,Post-Merge-2} entries appear in /bot stage-list and in your CI’s test_to_stage_mapping; missing entries will trigger “stage not found.”

Comment on lines +324 to 375
def pytestCommand = getPytestBaseCommandLine(
llmSrcNode,
stageName,
testListPathNode,
splits,
splitId,
perfMode,
jobWorkspace,
"__PLACEHOLDER_TRTLLM_WHL_PATH__",
"__PLACEHOLDER_coverageConfigFile__",
pytestUtil
).join(" ")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Critical: exporting pytestCommand with embedded quotes corrupts the script; write command to a file and read it at runtime.

The generated command contains quotes (e.g., __LUNOWUD="..."), which terminates export pytestCommand="...". This is a likely cause of the CI failures (e.g., DGX_B200-4_GPUs-PyTorch-Post-Merge-1). Persist the command via here-doc and read it in slurm_run.sh.

-                def pytestCommand = getPytestBaseCommandLine(
+                def pytestCommand = getPytestBaseCommandLine(
                     llmSrcNode,
@@
                 ).join(" ")
@@
-                def scriptContent = """#!/bin/bash
+                def scriptContent = """#!/bin/bash
                     export jobWorkspace=$jobWorkspace
@@
-                    export nodeCount=$nodeCount
-                    export pytestCommand="$pytestCommand"
+                    export nodeCount=$nodeCount
+                    cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD'
+${pytestCommand}
+PYTEST_CMD
+                    export pytestCommand="$(cat "$jobWorkspace/.pytest_cmd")"
                     export NVIDIA_IMEX_CHANNELS=0
                     chmod +x $scriptRunNode
                 """

No change needed in slurm_run.sh beyond replacing eval with bash -lc (already suggested). Optional: slurm_run.sh can prefer reading from $jobWorkspace/.pytest_cmd directly.

Also applies to: 337-350, 351-359, 360-368, 369-371

🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 324-336 (and similarly for ranges
337-350, 351-359, 360-368, 369-371), the script currently exports pytestCommand
as a quoted string which gets corrupted by embedded quotes; instead write the
full generated command to a file in the job workspace (e.g., create jobWorkspace
+ "/.pytest_cmd" using a here-doc or direct write) and export only the path or a
simple variable, then ensure slurm_run.sh reads and executes that file at
runtime (replace any eval usage with bash -lc "$(<$jobWorkspace/.pytest_cmd)" or
have slurm_run.sh source/exec the file); update the code that builds
pytestCommand to write the command to the file and not attempt to embed complex
quoted content into a single exported environment variable.

Comment on lines +85 to +95
if [ "$perfMode" = "true" ]; then
if [[ "$stageName" == *PyTorch* ]]; then
basePerfFilename="base_perf_pytorch.csv"
else
basePerfFilename="base_perf.csv"
fi
basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
echo "Check perf result"
python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \
$stageName/perf_script_test_results.csv \
$basePerfPath
echo "Check perf report"
python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \
--output_path $stageName/report.pdf \
--files $stageName/perf_script_test_results.csv \
$basePerfPath
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix perf paths: use jobWorkspace instead of stageName on SLURM node.

Results are written under $jobWorkspace; current paths won’t exist remotely.

-    basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
+    basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
@@
-    python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \
-        $stageName/perf_script_test_results.csv \
-        $basePerfPath
+    python3 "$llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py" \
+        "$jobWorkspace/perf_script_test_results.csv" \
+        "$basePerfPath"
@@
-    python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \
-        --output_path $stageName/report.pdf \
-        --files $stageName/perf_script_test_results.csv \
-        $basePerfPath
+    python3 "$llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py" \
+        --output_path "$jobWorkspace/report.pdf" \
+        --files "$jobWorkspace/perf_script_test_results.csv" \
+        "$basePerfPath"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [ "$perfMode" = "true" ]; then
if [[ "$stageName" == *PyTorch* ]]; then
basePerfFilename="base_perf_pytorch.csv"
else
basePerfFilename="base_perf.csv"
fi
basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
echo "Check perf result"
python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \
$stageName/perf_script_test_results.csv \
$basePerfPath
echo "Check perf report"
python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \
--output_path $stageName/report.pdf \
--files $stageName/perf_script_test_results.csv \
$basePerfPath
fi
if [ "$perfMode" = "true" ]; then
if [[ "$stageName" == *PyTorch* ]]; then
basePerfFilename="base_perf_pytorch.csv"
else
basePerfFilename="base_perf.csv"
fi
basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
echo "Check perf result"
python3 "$llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py" \
"$jobWorkspace/perf_script_test_results.csv" \
"$basePerfPath"
echo "Check perf report"
python3 "$llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py" \
--output_path "$jobWorkspace/report.pdf" \
--files "$jobWorkspace/perf_script_test_results.csv" \
"$basePerfPath"
fi
🧰 Tools
🪛 Shellcheck (0.10.0)

[warning] 85-85: perfMode is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents
In jenkins/scripts/slurm_run.sh around lines 85 to 101, the perf script
currently uses $stageName for result and report paths which do not exist on the
SLURM node; change those references to $jobWorkspace (e.g. use
"$jobWorkspace/perf_script_test_results.csv" and "$jobWorkspace/report.pdf") and
update both python calls to pass the basePerfPath and jobWorkspace paths,
quoting variables to handle spaces, so the sanity_perf_check and
create_perf_comparison_report operate on files under the remote job workspace.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16820 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16818 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16820 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12627 (Partly Tested) completed with status: 'FAILURE'

@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16880 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

324-336: Critical: exporting pytestCommand with embedded quotes corrupts the script; persist to file and read it at runtime

Joining the command and embedding it into export pytestCommand="..." breaks when the content itself contains quotes (e.g., __LUNOWUD="..."). This matches the CI failures on DGX_B200 stages. Write the command to a file and read it back to avoid quoting/escaping hell. Also enable strict bash options.

-                def pytestCommand = getPytestBaseCommandLine(
+                def pytestCommand = getPytestBaseCommandLine(
                     llmSrcNode,
                     stageName,
                     testListPathNode,
                     splits,
                     splitId,
                     perfMode,
                     jobWorkspace,
                     "__PLACEHOLDER_TRTLLM_WHL_PATH__",
                     "__PLACEHOLDER_coverageConfigFile__",
                     pytestUtil
                 ).join(" ")
@@
-                def scriptContent = """#!/bin/bash
+                def scriptContent = """#!/bin/bash
+                    set -euo pipefail
                     export jobWorkspace=$jobWorkspace
                     export tarName=$tarName
                     export llmTarfile=$llmTarfile
                     export llmSrcNode=$llmSrcNode
                     export stageName=$stageName
                     export perfMode=$perfMode
                     export resourcePathNode=$resourcePathNode
                     export nodeCount=$nodeCount
-                    export pytestCommand="$pytestCommand"
+                    # Persist full pytest command to avoid export-time quoting issues
+                    cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD'
+${pytestCommand}
+PYTEST_CMD
+                    export PYTEST_CMD_FILE="$jobWorkspace/.pytest_cmd"
                     export NVIDIA_IMEX_CHANNELS=0
                     chmod +x $scriptRunNode
                 """

Follow-up: slurm_run.sh should prefer reading and executing the command from $PYTEST_CMD_FILE if set (or replace eval with bash -lc "$(<"$PYTEST_CMD_FILE")"). I can send a PR to slurm_run.sh if needed.

#!/bin/bash
# Show vulnerable exports and ensure the new file-based approach is present
rg -nP --type=groovy -C2 'export\s+pytestCommand="?[^"]*"\$' jenkins/L0_Test.groovy
rg -nP --type=groovy -C2 '\.pytest_cmd' jenkins/L0_Test.groovy
# Inspect slurm_run.sh for 'eval' vs 'bash -lc' and optional file read
rg -n 'pytestCommand' jenkins/scripts/slurm_run.sh
rg -n 'eval' jenkins/scripts/slurm_run.sh

Also applies to: 337-371

🧹 Nitpick comments (6)
jenkins/L0_Test.groovy (6)

1-1: Avoid pinning to a personal/dev shared-lib branch in mainline CI

Use a stable tag/branch or make it configurable; dev branches can break CI unpredictably.


135-157: Cleanup should be best-effort and non-fatal

If the workspace was never created or already removed, rm -rf errors could fail the stage. Add || true.

-                    "rm -rf /home/svc_tensorrt/bloom/scripts/${jobUID}"
+                    "rm -rf /home/svc_tensorrt/bloom/scripts/${jobUID} || true"

161-170: Slurm args: single-node should use --gpus-per-node; uneven GPU splits need verification

  • For nodeCount == 1, prefer --gpus-per-node to align with containerized srun setups.
  • With uneven splits (e.g., 5 GPUs over 2 nodes), ceiled gpusPerNode=3 with --ntasks=5 and --ntasks-per-node=3 might overconstrain placement. Validate on your cluster.
-    return nodeCount == 1 ? [
-        "--nodes=${nodeCount}",
-        "--gpus=${gpuCount}"
-    ] : [
+    return nodeCount == 1 ? [
+        "--nodes=${nodeCount}",
+        "--gpus-per-node=${gpuCount}"
+    ] : [
         "--nodes=${nodeCount}",
         "--ntasks=${gpuCount}",
         "--ntasks-per-node=${gpusPerNode}",
         "--gpus-per-node=${gpusPerNode}",
     ]

172-230: getPytestBaseCommandLine: avoid hidden global state and fragile quoting

  • This method reads testFilter[(DETAILED_LOG)] implicitly. Prefer an explicit parameter for determinism and testability.
  • Returning a token array is good; the later export/quoting is where it breaks (see next comment).

If you want to decouple from globals with minimal churn:

-def getPytestBaseCommandLine(
+def getPytestBaseCommandLine(
     String llmSrc,
     String stageName,
     String testDBList,
     int splits,
     int split_id,
     Boolean perfMode,
     String outputPath,
     String trtllmWheelPath,
     String coverageConfigFile,
-    String pytestUtil = ""
+    String pytestUtil = "",
+    Boolean detailedLog = testFilter[(DETAILED_LOG)]
 ) {
@@
-        testFilter[(DETAILED_LOG)] ? "-s" : "",
+        detailedLog ? "-s" : "",

350-368: Harden launch script and remove unused local var

  • Add strict bash flags (already added above).
  • The local outputPath variable is unused; remove it.
-                } else {
-                    String outputPath = "${jobWorkspace}/job-output.log"
+                } else {
                     taskArgs = [
                         *taskArgs,
                     ]

232-233: Unused parameters in runLLMTestlistOnSlurm

skipInstallWheel and cpver are unused in this method. Either use them or drop to avoid confusion.

-def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, perfMode=false, stageName="Undefined", splitId=1, splits=1, gpuCount=1, nodeCount=1, skipInstallWheel=false, cpver="cp312")
+def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, perfMode=false, stageName="Undefined", splitId=1, splits=1, gpuCount=1, nodeCount=1)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between eeb92e1 and 2e08294.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
jenkins/L0_Test.groovy (10)

266-288: LGTM: remote script staging

Copying slurm_run.sh and the rendered test list to the FE node is clear and retriable.


306-318: LGTM: containerized srun assembly

Using --container-image, --container-workdir, and bind mounts looks correct for trt-llm containerized srun.


319-335: PyTest launcher selection for multi-node is reasonable

Using trtllm-llmapi-launch only when nodeCount > 1 makes sense.


378-387: LGTM: remote execution

Executing the pre-staged launcher over SSH is straightforward and retriable via Utils.exec.


391-392: LGTM: result upload then cleanup in finally

Ordering is correct; ensures artifacts are collected even on failures.


1389-1417: LGTM: on-box pytest assembly with coverage

Local path uses array form (no export), avoiding quoting issues. Coverage config generation looks fine.


1438-1441: LGTM: rerun flow integration

Running pytest, then conditional rerun via parsed fail signatures is wired correctly.

Also applies to: 1445-1445


1818-1822: Verify new DGX_B200 Slurm mapping and partition label

Ensure "b200-trtllm" exists in SlurmConfig and the hardware matches 4-GPU expectations. These stages are the ones failing via /bot; fix in quoting (above) should unblock, but please double-check partition naming.


1847-1850: Verify GB200 single-node Slurm mapping

Confirm "gb200-trtllm" partition is available and correctly wired to the intended rack/pool.


1854-1862: Multi-node SBSA configs: sanity-check MPI settings and resource math

  • Using --mpi=pmi2 above: confirm PMI2 is available on these clusters.
  • 8 GPUs across 2 nodes implies 4 per-node; ensure getNodeArgs and cluster constraints match.

Also applies to: 1892-1893

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16880 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12677 (Partly Tested) completed with status: 'SUCCESS'

@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16890 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
jenkins/L0_Test.groovy (1)

324-336: Critical: exporting pytestCommand with embedded quotes will break the script; persist to a file and consume at runtime.

The command contains quotes (e.g., __LUNOWUD="..."), which corrupts export pytestCommand="...". Write the full command to a file and have slurm_run.sh read/execute it.

-                def pytestCommand = getPytestBaseCommandLine(
+                def pytestCommand = getPytestBaseCommandLine(
@@
-                def scriptContent = """#!/bin/bash
+                def scriptContent = """#!/bin/bash
+                    set -euo pipefail
                     export jobWorkspace=$jobWorkspace
@@
-                    export nodeCount=$nodeCount
-                    export pytestCommand="$pytestCommand"
-                    export NVIDIA_IMEX_CHANNELS=0
-                    chmod +x $scriptRunNode
+                    export nodeCount=$nodeCount
+                    export NVIDIA_IMEX_CHANNELS=0
+                    # Persist pytest command verbatim to avoid quote corruption
+                    cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD'
+${pytestCommand}
+PYTEST_CMD
+                    export PYTEST_CMD_FILE="$jobWorkspace/.pytest_cmd"
+                    chmod +x "$scriptRunNode"
@@
-                scriptContent = scriptContent.replaceAll('\t','').stripIndent()
+                scriptContent = scriptContent.stripIndent()

Companion change required in jenkins/scripts/slurm_run.sh (outside this file):

  • Replace any eval usage with: bash -lc "$(<"${PYTEST_CMD_FILE}")"
  • Or read the file into a variable and execute with bash -lc "$cmd".
#!/bin/bash
# Verify slurm_run.sh does not use eval and can read PYTEST_CMD_FILE
rg -nC2 -e 'eval\s' -e 'PYTEST_CMD_FILE' jenkins/scripts/slurm_run.sh || true

Also applies to: 337-349, 350-368, 369-371

🧹 Nitpick comments (4)
jenkins/L0_Test.groovy (4)

172-230: Make getPytestBaseCommandLine robust: drop empty args; avoid hidden globals.

  • Filter out empty strings so argv is clean (pytestUtil may be empty).
  • Optional (nice): pass a detailedLog flag instead of reading global testFilter.
 def getPytestBaseCommandLine(
@@
-        testFilter[(DETAILED_LOG)] ? "-s" : "",
+        testFilter[(DETAILED_LOG)] ? "-s" : "",
@@
-    return testCmdLine as String[]
+    // remove blanks/nulls
+    testCmdLine = testCmdLine.findAll { it?.toString().trim() }
+    return testCmdLine as String[]

319-323: Avoid injecting empty argv element for pytestUtil.

Minor cleanup: don’t pass a blank arg.

-                String pytestUtil = ""
-                if (nodeCount > 1) {
-                    pytestUtil = "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch"
-                }
+                String pytestUtil = (nodeCount > 1) ? "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch" : null

(Note: with the previous argv filtering in getPytestBaseCommandLine, nulls/blanks are dropped.)


360-367: Remove unused variable.

outputPath is unused.

-                    String outputPath = "${jobWorkspace}/job-output.log"

1405-1417: Local run: argv construction is good; optional: align with file-based command execution for parity.

Not required, but you can mirror the file-based approach here for consistency and easier debugging.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2e08294 and 19ffc0d.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (11 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
jenkins/L0_Test.groovy (1)

1818-1822: Ensure SlurmConfig & mapping script support new b200-trtllm/gb200-trtllm keys. Add these labels to SlurmConfig.partitionConfig and clusterConfig, and verify the test-to-stage mapping script (path tests/integration/defs/scripts/test_to_stage_mapping.py appears missing) correctly resolves the keys at L0_Test.groovy lines 1818–1822, 1847–1850, 1855–1862 and 1892–1893.

@@ -1,4 +1,4 @@
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Do not merge with a dev shared library reference. Revert to main.

Pinning to a feature branch will break main after merge.

-@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

🏁 Script executed:

#!/bin/bash
# Verify no other dev shared-lib references exist
grep -R "dev-yuanjingx-slurm_refactor" -n .
# Confirm @Library usage in the Jenkinsfile under review
grep -R "@Library" -n jenkins/L0_Test.groovy || true
# Preview the first few lines of jenkins/L0_Test.groovy
head -n 5 jenkins/L0_Test.groovy

Length of output: 564


Revert bloom-jenkins-shared-lib to @main
Pinning to dev-yuanjingx-slurm_refactor will break the main pipeline—update line 1 accordingly.

-@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@main',                    'trtllm-jenkins-shared-lib@main']) _
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around line 1, the Jenkins shared library reference is
pinned to dev-yuanjingx-slurm_refactor which will break the main pipeline;
change the library ref to use the stable main branch by replacing
'@dev-yuanjingx-slurm_refactor' with '@main' so the line becomes
@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
and save the file.

Comment on lines 160 to +201
return [
return nodeCount == 1 ? [
"--nodes=${nodeCount}",
"--gpus=${gpuCount}"
] : [
"--nodes=${nodeCount}",
"--ntasks=${gpuCount}",
"--ntasks-per-node=${gpusPerNode}",
"--gpus-per-node=${gpusPerNode}",
].join(" ")
]
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

❓ Verification inconclusive

Harden getNodeArgs: handle invalid inputs; avoid deprecated BigDecimal rounding.

Avoid ROUND_CEILING and return null for invalid node/gpu counts (caller already checks for null).

-def getNodeArgs(int nodeCount, int gpuCount) {
-    int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
+def getNodeArgs(int nodeCount, int gpuCount) {
+    if (nodeCount < 1 || gpuCount < 1) {
+        return null
+    }
+    int gpusPerNode = Math.ceil(gpuCount / (double) nodeCount) as int
     return nodeCount == 1 ? [
         "--nodes=${nodeCount}",
         "--gpus=${gpuCount}"
     ] : [
         "--nodes=${nodeCount}",
         "--ntasks=${gpuCount}",
         "--ntasks-per-node=${gpusPerNode}",
         "--gpus-per-node=${gpusPerNode}",
     ]
 }

Harden getNodeArgs: validate inputs & avoid deprecated BigDecimal rounding.

  • Guard against invalid counts (nodeCount < 1 || gpuCount < 1) by returning null.
  • Replace
    ((gpuCount/nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
    with
    Math.ceil(gpuCount/(double)nodeCount) as int.

Verify that every caller properly handles a null return from getNodeArgs.

🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 160 to 170, harden getNodeArgs by first
validating inputs and returning null when nodeCount < 1 || gpuCount < 1, replace
the deprecated BigDecimal rounding expression with
Math.ceil(gpuCount/(double)nodeCount) cast to int for gpusPerNode, and update
callers to handle a possible null return (e.g., check for null and abort or
propagate error) so invalid inputs don't produce incorrect args.

Comment on lines +270 to 312
@@ -357,80 +277,118 @@
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Don’t pass the password on the command line. Use sshpass -e.

Prevents leaking secrets via process list/logging.

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
+Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
stage("[${stageName}] Initializing Test") {
// Create Job Workspace folder in Frontend Node
Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
stage("[${stageName}] Initializing Test") {
// Create Job Workspace folder in Frontend Node
Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 270 to 273, do not interpolate the
plaintext password into the sshpass command; instead switch to sshpass -e and
supply the password via the SSHPASS environment variable (use Jenkins
credentials binding or withEnv) so the password is not visible on the command
line or process list. Update the Utils.exec invocation to remove
'${remote.passwd}' from the command string and ensure remote.passwd is injected
into the process environment (or retrieved from Jenkins credentials) before
running sshpass -e ssh ..., keeping the same COMMON_SSH_OPTIONS, remote
user/host and mkdir -p jobWorkspace and retaining numRetries.

Comment on lines +391 to 433
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Re-enable cleanup, gated by a debug flag.

Leaving remote workspaces indefinitely will leak disk.

-        //cleanUpNodeResources(pipeline, cluster, jobUID)
+        if (!(env.KEEP_SLURM_WS_FOR_DEBUG?.toBoolean())) {
+            cleanUpNodeResources(pipeline, cluster, jobUID)
+        } else {
+            echo "KEEP_SLURM_WS_FOR_DEBUG=true; preserving remote workspace for ${jobUID}"
+        }

🏁 Script executed:

#!/bin/bash
# Locate definition of the cleanup helper
rg -n 'def cleanUpNodeResources' -g '*.groovy'
# Find all invocations of the cleanup call
rg -n 'cleanUpNodeResources' -g '*.groovy'
# Check for existing debug flag usage or declaration
rg -n 'KEEP_SLURM_WS_FOR_DEBUG' -g '*.groovy'

Length of output: 430


Re-enable cleanup behind a debug flag in jenkins/L0_Test.groovy:391
Replace the commented-out call with:

if (!env.KEEP_SLURM_WS_FOR_DEBUG?.toBoolean()) {
    cleanUpNodeResources(pipeline, cluster, jobUID)
} else {
    echo "KEEP_SLURM_WS_FOR_DEBUG=true; preserving remote workspace for ${jobUID}"
}

Also add and document the KEEP_SLURM_WS_FOR_DEBUG environment variable or pipeline parameter.

🤖 Prompt for AI Agents
In jenkins/L0_Test.groovy around lines 391-392, re-enable node cleanup but guard
it with a debug flag: replace the commented-out cleanUpNodeResources call with a
conditional that checks env.KEEP_SLURM_WS_FOR_DEBUG (toBoolean) and only calls
cleanUpNodeResources(pipeline, cluster, jobUID) when the flag is false;
otherwise emit an echo that the workspace is preserved for the jobUID. Also add
a documented pipeline parameter or environment variable definition for
KEEP_SLURM_WS_FOR_DEBUG (default false) in the pipeline configuration or README
so users know how to enable workspace preservation for debugging.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16890 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12687 (Partly Tested) completed with status: 'SUCCESS'

@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@yuanjingx87 yuanjingx87 force-pushed the user/yuanjingx/slurm_test_tmp branch from 19ffc0d to adcb573 Compare August 29, 2025 09:17
@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16968 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16968 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12737 (Partly Tested) completed with status: 'FAILURE'

@yuanjingx87 yuanjingx87 force-pushed the user/yuanjingx/slurm_test_tmp branch from adcb573 to 061fcba Compare August 29, 2025 14:56
@yuanjingx87
Copy link
Collaborator Author

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16999 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16999 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12764 (Partly Tested) completed with status: 'SUCCESS'

@yuanjingx87 yuanjingx87 closed this Sep 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants