User/yuanjingx/slurm test tmp #7309

yuanjingx87 · 2025-08-27T19:07:53Z

Summary by CodeRabbit

New Features
- Unified SLURM test execution using a scripted remote launch for single- and multi-node runs.
- Automatic performance sanity checks and generation of comparison reports in perf mode.
- More robust pytest command assembly supporting test-splitting, perf mode, coverage, and reruns.
Refactor
- Consolidated resource cleanup and unified run staging/labels.
- Dynamic coverage handling including wheel-derived paths and runtime library path adjustments.
- Updated test runtime mappings to TRT-LLM targets.
Chores
- Adjusted exported environment variables (added UCX_TLS; removed several legacy exports).

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-27T19:08:02Z

📝 Walkthrough

Walkthrough

Unifies SLURM test execution into a node-count–aware API, adds helpers to assemble pytest invocations and node args, switches multi-node launch to a generated script executed remotely, remaps runtime labels to TRT-LLM variants, and refactors slurm_run.sh to template/evaluate a single pytestCommand with coverage and perf checks.

Changes

Cohort / File(s)	Summary of Changes
Jenkins SLURM orchestration `jenkins/L0_Test.groovy`	Renamed and unified cleanup and run entrypoints (`cleanUpNodeResources`, `runLLMTestlistOnSlurm`), added `getPytestBaseCommandLine` and `getNodeArgs`, switched to generating/copying/executing a `slurm_launch.sh` for multi-node runs, updated rerun logic, stage labels, and runtime mappings to TRT-LLM variants.
Slurm launch script `jenkins/scripts/slurm_run.sh`	Added `set_value_in_command` helper; replaced array-of-commands with a single templated `pytestCommand` (injected via placeholders and executed with `eval`); dynamically generate `.coveragerc`; compute/inject TRT-LLM wheel path; augment `LD_LIBRARY_PATH`; log hostname; add perf-mode sanity checks and perf report generation; adjusted exported env vars (add `UCX_TLS=^gdr_copy`, remove several previous exports).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Jenkins as Jenkins (L0_Test.groovy)
  participant Slurm as SlurmCluster
  participant Node as Remote Node (SSH)
  participant Launch as slurm_launch.sh
  participant PyTest as pytest

  Jenkins->>Jenkins: build pytestCommand via getPytestBaseCommandLine
  Jenkins->>Slurm: request allocation (nodeCount, gpu args via getNodeArgs)
  Jenkins->>Node: copy generated slurm_launch.sh and support files
  Jenkins->>Node: SSH -> execute slurm_launch.sh
  Node->>Launch: start, set env, inject placeholders via set_value_in_command
  Launch->>PyTest: eval pytestCommand
  PyTest-->>Launch: exit/status, produce coverage/artifacts
  Launch->>Launch: run optional perf checks, generate report
  Launch-->>Jenkins: exit status and artifacts
  Jenkins->>Slurm: cleanUpNodeResources (release)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

set NVIDIA_IMEX_CHANNELS for dlcluster slurm job only #6234 — Touches runLLMTestlistOnSlurm flow and docker/env assembly; intersects with how pytest/docker args are constructed.
[None][infra] Prepare for single GPU GB200 test pipeline #7073 — Modifies multi-node launch flow and jenkins/scripts/slurm_run.sh; closely overlaps the scripted remote execution and pytest templating.
[None][infra] Split GB200 stages for each test #6594 — Updates SLURM test configuration mappings; overlaps with TRT-LLM runtime remapping present here.

Suggested reviewers

tburt-nv
chzblych

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

yuanjingx87 · 2025-08-27T19:09:00Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-27T19:14:40Z

PR_Github #16742 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

jenkins/scripts/slurm_run.sh (1)

69-76: Trim values; don’t mangle paths.

Avoid replacing spaces with underscores; just read and prepend cleanly.

-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
+containerPipLLMLibPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}')
+containerLDLibPath="${LD_LIBRARY_PATH}"
 if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then
   containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath"
   containerLDLibPath="${containerLDLibPath%:}"
 fi
 export LD_LIBRARY_PATH=$containerLDLibPath

🧹 Nitpick comments (7)

jenkins/scripts/slurm_run.sh (2)
34-34: Install reliability: run apt-get update before installing.

Precede install with update to reduce transient failures.
-    apt-get install -y libffi-dev
+    apt-get update -y && apt-get install -y libffi-dev
81-81: Remove or fix unused variable in log.

$testCase is never set.
-echo "Running: $testCase"
+echo "Running stage: $stageName"
jenkins/L0_Test.groovy (5)
159-170: Validate Slurm task/GPU mapping; consider adding --gpus-per-task=1.

With ceil() on gpusPerNode and --ntasks=${gpuCount}, tasks/GPU mapping can drift on uneven splits. Make binding explicit.
     return nodeCount == 1 ? [
         "--nodes=${nodeCount}",
         "--gpus=${gpuCount}"
     ] : [
         "--nodes=${nodeCount}",
         "--ntasks=${gpuCount}",
         "--ntasks-per-node=${gpusPerNode}",
         "--gpus-per-node=${gpusPerNode}",
+        "--gpus-per-task=1",
     ]
If your Slurm defaults already bind GPUs per task, please confirm.

172-230: Command builder looks solid; minor nit on empty args.

Joining includes empty strings when not detailed logging; harmless but can be filtered to keep logs cleaner.
-        testFilter[(DETAILED_LOG)] ? "-s" : "",
+        *(testFilter[(DETAILED_LOG)] ? ["-s"] : []),
266-287: Ensure slurm_run.sh exists from the unpacked tree.

You extract the tar locally to copy slurm_run.sh. If packaging ever changes, this path breaks. Add a quick existence check to fail early.
-                def scriptRunLocalPath = "${llmSrcLocal}/jenkins/scripts/slurm_run.sh"
+                def scriptRunLocalPath = "${llmSrcLocal}/jenkins/scripts/slurm_run.sh"
+                if (!fileExists(scriptRunLocalPath)) {
+                    error "slurm_run.sh not found at ${scriptRunLocalPath}"
+                }
312-318: Quoting safety for exporting pytestCommand into the launch script.

Long commands with quotes can break the here-doc export. Consider base64-encoding and decoding in slurm_run.sh, or at least escape embedded quotes.

Example (conceptual):
def cmdB64 = pytestCommand.bytes.encodeBase64().toString()
scriptContent = """#!/bin/bash
export pytestCommandB64="${cmdB64}"
...
"""
// In slurm_run.sh: pytestCommand="$(echo "$pytestCommandB64" | base64 -d)"
Also applies to: 337-349

379-387: SSH execution path: add basic retry to handle transient SSH/SRUN flakiness.

Occasional transient remote failures are common; a short retry loop improves robustness.
-                        """bash ${scriptLaunchPathNode}"""
+                        """bash ${scriptLaunchPathNode}""" ,
And wrap Utils.exec with numRetries similar to earlier copy steps.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8a619be and 6b19ad1.

📒 Files selected for processing (2)

jenkins/L0_Test.groovy (10 hunks)
jenkins/scripts/slurm_run.sh (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

jenkins/scripts/slurm_run.sh

🪛 Shellcheck (0.10.0)

jenkins/scripts/slurm_run.sh

[warning] 40-40: hostname is referenced but not assigned.

(SC2154)

[warning] 40-40: stageName is referenced but not assigned.

(SC2154)

[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

[warning] 85-85: perfMode is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

jenkins/L0_Test.groovy (3)

351-359: MPI flag only for multi-node: LGTM.

Using --mpi=pmi2 conditionally makes sense with the containerized srun flow.

Also applies to: 360-368

1390-1417: Unified pytest command path: LGTM.

The builder plus rerun logic integration looks correct and readable.

Also applies to: 1438-1446

1818-1822: Stage mappings updated to TRT-LLM variants: LGTM.

New x86/SBSA Slurm configs and multi-node stages align with the new runner.

Also applies to: 1847-1851, 1854-1862

coderabbitai · 2025-08-27T19:17:06Z

jenkins/scripts/slurm_run.sh

+set_value_in_command() {
+    # Parameters
+    local key="$1"
+    local value="$2"
+    local command="$3"
+
+    # Transform the key
+    local placeholder="__PLACEHOLDER_${key}__"
+
+    # Check if placeholder exists
+    if [[ "$command" != *"$placeholder"* ]]; then
+        echo "Error: placeholder '$placeholder' not found in the command" >&2
+        return 1
+    fi
+
+    # Replace all occurrences
+    local result="${command//${placeholder}/${value}}"
+
+    # Return the result
+    echo "$result"
+}


🛠️ Refactor suggestion

Fail fast on missing placeholders and surface errors.

set_value_in_command returns 1 if not found; callers don’t check, risking an empty pytestCommand. Either exit on error or handle status.

set_value_in_command() { @@ - if [[ "$command" != *"$placeholder"* ]]; then - echo "Error: placeholder '$placeholder' not found in the command" >&2 - return 1 - fi + if [[ "$command" != *"$placeholder"* ]]; then + echo "Error: placeholder '$placeholder' not found in the command" >&2 + exit 1 + fi }

Optionally also enable strict mode near the top:

#!/bin/bash +set -euo pipefail

Also applies to: 55-55, 67-67

🤖 Prompt for AI Agents

In jenkins/scripts/slurm_run.sh around lines 5 to 25 (also apply same treatment at lines 55 and 67), the helper set_value_in_command currently returns a non-zero status when a placeholder is missing but callers do not check the return code, which can lead to unset/empty command variables; change the function to fail-fast by printing the error to stderr and then calling exit 1 (or alternatively propagate the error by returning non-zero and ensure callers immediately check and exit on failure), and enable strict mode near the top of the script (set -euo pipefail and optionally IFS=$'\n\t') so any unexpected failure halts execution; apply the same error-handling change to the other placeholder-replacement usages at lines 55 and 67 to ensure missing placeholders abort the run.

coderabbitai · 2025-08-27T19:17:06Z

jenkins/scripts/slurm_run.sh

    git config --global --add safe.directory "*"
    gpuUuids=$(nvidia-smi -q | grep "GPU UUID" | awk '{print $4}' | tr '\n' ',' || true)
-    echo "HOST_NODE_NAME = $HOST_NODE_NAME ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"


⚠️ Potential issue

Fix undefined hostname variable and extra '=' in log line.

Use command substitution; drop the duplicate '='.

- echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName" + echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"

echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 40-40: hostname is referenced but not assigned.

(SC2154)

[warning] 40-40: stageName is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents

In jenkins/scripts/slurm_run.sh around line 40, the echo uses an undefined $hostname variable and includes an extra '=' before $gpuUuids; replace $hostname with command substitution (e.g. $(hostname) or $(hostname -s) as appropriate), remove the duplicated '=' so GPU_UUIDS = $gpuUuids, and optionally quote variables in the echo to avoid word-splitting (e.g. "$stageName" and "$gpuUuids").

coderabbitai · 2025-08-27T19:17:06Z

jenkins/scripts/slurm_run.sh

-export LLM_MODELS_ROOT=$MODEL_CACHE_DIR
+
 export UCX_TLS=^gdr_copy
 cd $llmSrcNode/tests/integration/defs


🛠️ Refactor suggestion

Ensure directory change failure stops the run.

Guard against a bad path to avoid running tests from the wrong dir.

-cd $llmSrcNode/tests/integration/defs +cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

cd $llmSrcNode/tests/integration/defs

cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents

In jenkins/scripts/slurm_run.sh around line 49, the script does a plain cd to $llmSrcNode/tests/integration/defs which will silently continue if the path is invalid; change it to guard against failure by checking cd's exit status and aborting the run on error (for example, replace the bare cd with an explicit check such as attempting cd and on failure printing a clear error to stderr and exiting with non‑zero status).

coderabbitai · 2025-08-27T19:17:06Z

jenkins/scripts/slurm_run.sh

 trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
 trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
 echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
-sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"
-testCmdLines+=(
-    "--cov=$llmSrcNode/examples/"
-    "--cov=$llmSrcNode/tensorrt_llm/"
-    "--cov=$trtllmWhlPath/tensorrt_llm/"
-    "--cov-report="
-    "--cov-config=$coverageConfigFile"
-)
+pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")
+


🛠️ Refactor suggestion

Robustly resolve tensorrt_llm location; current sed regex is ineffective.

sed 's/[[:space:]]+/_/g' treats + literally; whitespace isn’t trimmed. Use awk to extract/trim and fail fast if empty.

-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2) -trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g') +trtllmWhlPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}') +[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; } pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)

trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')

echo "TRTLLM WHEEL PATH: $trtllmWhlPath"

sed -i "s|---wheel_path---|$trtllmWhlPath|g" "$coverageConfigFile"

testCmdLines+=(

"--cov=$llmSrcNode/examples/"

"--cov=$llmSrcNode/tensorrt_llm/"

"--cov=$trtllmWhlPath/tensorrt_llm/"

"--cov-report="

"--cov-config=$coverageConfigFile"

)

pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")

trtllmWhlPath=$(pip3 show tensorrt_llm \

| awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}')

[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; }

echo "TRTLLM WHEEL PATH: $trtllmWhlPath"

pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")

🤖 Prompt for AI Agents

In jenkins/scripts/slurm_run.sh around lines 52-56, the sed invocation is wrong (the '+' is literal) and doesn't reliably trim or extract the Location; replace it with an awk-based extraction and trimming pipeline that finds the "Location:" field from pip3 show, prints the value, and optionally replaces internal whitespace with underscores if needed, then immediately fail fast (exit non-zero with an error log) if the resulting variable is empty before continuing to set pytestCommand. Ensure the script logs the resolved path and exits with a clear error message when tensorrt_llm Location cannot be determined.

coderabbitai · 2025-08-27T19:17:06Z

jenkins/scripts/slurm_run.sh

-echo "Full Command: $fullCmd"
-eval $fullCmd
+echo "Full Command: $pytestCommand"
+eval $pytestCommand


🛠️ Refactor suggestion

Avoid eval; run the command safely while preserving env assignments.

eval risks unintended expansion. bash -lc is safer here.

-eval $pytestCommand +bash -lc "$pytestCommand"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

eval $pytestCommand

bash -lc "$pytestCommand"

🤖 Prompt for AI Agents

In jenkins/scripts/slurm_run.sh around line 83, avoid using eval to run the pytestCommand because eval can cause unintended expansion; instead invoke a new bash shell with -l -c and pass the entire pytestCommand string as a single argument so environment variable assignments in the string are preserved and not subject to eval expansion. Replace the eval usage with a call that runs the command under bash -lc, ensuring the pytestCommand is passed as one quoted argument and not re-evaluated by the current shell. Ensure any needed environment assignments remain inside pytestCommand or are exported beforehand so they are visible to the new shell.

tensorrt-cicd · 2025-08-27T19:56:09Z

PR_Github #16742 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12565 (Partly Tested) completed with status: 'FAILURE'

yuanjingx87 · 2025-08-27T21:05:50Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
97-101: Normalize GPTOSS backend tags to lowercase

Uppercase backend tags (CUTLASS, TRITON, TRTLLM) are still present in l0_dgx_b200.yml and will cause parameter mismatches in the updated runner. Please apply the following changes:

Files/locations to update:

tests/integration/test_lists/test-db/l0_dgx_b200.yml, lines 97–101
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-CUTLASS]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-TRITON]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[ep4-TRTLLM]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-CUTLASS]
-  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-TRITON]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-cutlass]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[tp4-triton]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[ep4-trtllm]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-cutlass]
+  - accuracy/test_llm_api_pytorch.py::TestGPTOSS::test_w4_4gpus[dp4-triton]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6b19ad1 and e69b735.

📒 Files selected for processing (1)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
PR: NVIDIA/TensorRT-LLM#7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/test_lists/test-db/l0_dgx_b200.yml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

44-47: Good normalization of backend tags to lowercase.

The tp/ep/dp backends for TestGPTOSS were lowercased (trtllm, cutlass, triton), which aligns with recent runner naming unification and reduces case-sensitivity pitfalls.

tensorrt-cicd · 2025-08-27T21:11:05Z

PR_Github #16749 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-27T22:39:42Z

PR_Github #16749 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12571 (Partly Tested) completed with status: 'FAILURE'

yuanjingx87 · 2025-08-28T07:33:44Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-28T07:38:51Z

PR_Github #16818 [ run ] triggered by Bot

yuanjingx87 · 2025-08-28T07:39:26Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

jenkins/scripts/slurm_run.sh (1)

69-79: Fix LD_LIBRARY_PATH augmentation; avoid broken sed and ensure idempotence.

-containerPipLLMLibPath=$(pip3 show tensorrt_llm | grep "Location" | awk -F ":" '{ gsub(/ /, "", $2); print $2"/tensorrt_llm/libs"}')
-containerPipLLMLibPath=$(echo "$containerPipLLMLibPath" | sed 's/[[:space:]]+/_/g')
-containerLDLibPath=$LD_LIBRARY_PATH
-containerLDLibPath=$(echo "$containerLDLibPath" | sed 's/[[:space:]]+/_/g')
-if [[ "$containerLDLibPath" != *"$containerPipLLMLibPath"* ]]; then
-  containerLDLibPath="$containerPipLLMLibPath:$containerLDLibPath"
-  containerLDLibPath="${containerLDLibPath%:}"
-fi
-export LD_LIBRARY_PATH=$containerLDLibPath
+containerPipLLMLibPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2"/tensorrt_llm/libs"}')
+containerLDLibPath="${LD_LIBRARY_PATH:-}"
+case ":$containerLDLibPath:" in
+  *":$containerPipLLMLibPath:"*) ;;
+  *) containerLDLibPath="$containerPipLLMLibPath${containerLDLibPath:+:$containerLDLibPath}";;
+esac
+export LD_LIBRARY_PATH="$containerLDLibPath"

♻️ Duplicate comments (5)

jenkins/scripts/slurm_run.sh (5)

1-4: Fail fast on placeholder errors and enable strict mode.

Uncaught errors from set_value_in_command can silently corrupt pytestCommand. Also adopt strict mode.

 #!/bin/bash
+set -euo pipefail
+IFS=$'\n\t'
 
 set_value_in_command() {
@@
-    if [[ "$command" != *"$placeholder"* ]]; then
-        echo "Error: placeholder '$placeholder' not found in the command" >&2
-        return 1
-    fi
+    if [[ "$command" != *"$placeholder"* ]]; then
+        echo "Error: placeholder '$placeholder' not found in the command" >&2
+        exit 1
+    fi
 }

Also applies to: 5-25

40-40: Fix undefined $hostname and extra '='; quote values.

Use command substitution and remove the duplicate equals.

-    echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
+    echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = ${gpuUuids} ; STAGE_NAME = ${stageName:-unknown}"

49-49: Guard cd to fail early if path is wrong.

-cd $llmSrcNode/tests/integration/defs
+cd "$llmSrcNode/tests/integration/defs" || { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }

52-56: Robustly resolve TRT-LLM wheel location; current sed is ineffective.

-trtllmWhlPath=$(pip3 show tensorrt_llm | grep Location | cut -d ' ' -f 2)
-trtllmWhlPath=$(echo "$trtllmWhlPath" | sed 's/[[:space:]]+/_/g')
-echo "TRTLLM WHEEL PATH: $trtllmWhlPath"
+trtllmWhlPath=$(pip3 show tensorrt_llm | awk -F': ' '/^Location/{gsub(/^[[:space:]]+|[[:space:]]+$/,"",$2); print $2}')
+[ -n "$trtllmWhlPath" ] || { echo "Failed to resolve tensorrt_llm wheel path" >&2; exit 1; }
+echo "TRTLLM WHEEL PATH: ${trtllmWhlPath}"
 pytestCommand=$(set_value_in_command "TRTLLM_WHL_PATH" "$trtllmWhlPath" "$pytestCommand")

82-83: Avoid eval; execute safely preserving env assignments.

-echo "Full Command: $pytestCommand"
-eval $pytestCommand
+echo "Full Command: $pytestCommand"
+bash -lc "$pytestCommand"

🧹 Nitpick comments (5)

jenkins/scripts/slurm_run.sh (2)
57-68: Quoting and path safety for generated .coveragerc.

Use quotes to avoid word-splitting; keep consistent placeholder casing.
-cat << EOF > $jobWorkspace/.coveragerc
+cat << EOF > "$jobWorkspace/.coveragerc"
@@
-data_file = $jobWorkspace/.coverage.$stageName
+data_file = $jobWorkspace/.coverage.$stageName
@@
-    $llmSrcNode/tensorrt_llm/
-    $trtllmWhlPath/tensorrt_llm/
+    $llmSrcNode/tensorrt_llm/
+    $trtllmWhlPath/tensorrt_llm/
 EOF
-pytestCommand=$(set_value_in_command "coverageConfigFile" "$jobWorkspace/.coveragerc" "$pytestCommand")
+pytestCommand=$(set_value_in_command "coverageConfigFile" "$jobWorkspace/.coveragerc" "$pytestCommand")
81-81: Drop or set unused $testCase.

Variable is undefined; the log line adds noise.
-echo "Running: $testCase"
+# echo "Running test case: $testCase" # Define testCase if needed or remove this line.
jenkins/L0_Test.groovy (3)
172-199: Quoting in extraInternalEnv can break downstream export; keep env assigns simple.

The token __LUNOWUD="...": when embedded into a double-quoted export later, it breaks quoting. Safer to avoid inner quotes in env assigns or ensure the command is injected as a file (see below).
-    extraInternalEnv = "__LUNOWUD=\"-thread_pool_size=${TESTER_CORES}\""
+    extraInternalEnv = "__LUNOWUD=-thread_pool_size=${TESTER_CORES}"
If you must keep the quotes, apply the “write command to file” fix in the launch script comment below.

Also applies to: 200-229

379-387: Run launcher with explicit shell and error propagation.

Use bash -lc to preserve env and stop on failure of srun.
-                        """bash ${scriptLaunchPathNode}"""
+                        """bash -lc '${scriptLaunchPathNode}'"""
160-170: Minor: prefer RoundingMode for BigDecimal.

ROUND_CEILING is deprecated; not blocking but easy to modernize.
-    int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
+    int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, java.math.RoundingMode.CEILING).intValue()

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e69b735 and eeb92e1.

📒 Files selected for processing (2)

jenkins/L0_Test.groovy (10 hunks)
jenkins/scripts/slurm_run.sh (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

jenkins/scripts/slurm_run.sh

🪛 Shellcheck (0.10.0)

jenkins/scripts/slurm_run.sh

[warning] 40-40: hostname is referenced but not assigned.

(SC2154)

[warning] 40-40: stageName is referenced but not assigned.

(SC2154)

[warning] 49-49: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

[warning] 85-85: perfMode is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

jenkins/L0_Test.groovy (3)

161-170: Node/gpu arg calculation looks good.

Ceiling split per node and per-task settings make sense for uneven GPU distributions.

1406-1417: Good: local (non-SLURM) pytest command assembly and rerun flow are coherent.

The coverage config, LD_LIBRARY_PATH setup, and rerun integration are consistent.

Also applies to: 1439-1446

1818-1822: Verify new Slurm stage keys in bot stage-list and CI mapping
Ensure the DGX_B200-4_GPUs-PyTorch-{1,2,Post-Merge-1,Post-Merge-2} entries appear in /bot stage-list and in your CI’s test_to_stage_mapping; missing entries will trigger “stage not found.”

coderabbitai · 2025-08-28T07:42:35Z

jenkins/L0_Test.groovy

+                def pytestCommand = getPytestBaseCommandLine(
+                    llmSrcNode,
+                    stageName,
+                    testListPathNode,
+                    splits,
+                    splitId,
+                    perfMode,
+                    jobWorkspace,
+                    "__PLACEHOLDER_TRTLLM_WHL_PATH__",
+                    "__PLACEHOLDER_coverageConfigFile__",
+                    pytestUtil
+                ).join(" ")



🛠️ Refactor suggestion

⚠️ Potential issue

Critical: exporting pytestCommand with embedded quotes corrupts the script; write command to a file and read it at runtime.

The generated command contains quotes (e.g., __LUNOWUD="..."), which terminates export pytestCommand="...". This is a likely cause of the CI failures (e.g., DGX_B200-4_GPUs-PyTorch-Post-Merge-1). Persist the command via here-doc and read it in slurm_run.sh.

- def pytestCommand = getPytestBaseCommandLine( + def pytestCommand = getPytestBaseCommandLine( llmSrcNode, @@ ).join(" ") @@ - def scriptContent = """#!/bin/bash + def scriptContent = """#!/bin/bash export jobWorkspace=$jobWorkspace @@ - export nodeCount=$nodeCount - export pytestCommand="$pytestCommand" + export nodeCount=$nodeCount + cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD' +${pytestCommand} +PYTEST_CMD + export pytestCommand="$(cat "$jobWorkspace/.pytest_cmd")" export NVIDIA_IMEX_CHANNELS=0 chmod +x $scriptRunNode """

No change needed in slurm_run.sh beyond replacing eval with bash -lc (already suggested). Optional: slurm_run.sh can prefer reading from $jobWorkspace/.pytest_cmd directly.

Also applies to: 337-350, 351-359, 360-368, 369-371

🤖 Prompt for AI Agents

In jenkins/L0_Test.groovy around lines 324-336 (and similarly for ranges 337-350, 351-359, 360-368, 369-371), the script currently exports pytestCommand as a quoted string which gets corrupted by embedded quotes; instead write the full generated command to a file in the job workspace (e.g., create jobWorkspace + "/.pytest_cmd" using a here-doc or direct write) and export only the path or a simple variable, then ensure slurm_run.sh reads and executes that file at runtime (replace any eval usage with bash -lc "$(<$jobWorkspace/.pytest_cmd)" or have slurm_run.sh source/exec the file); update the code that builds pytestCommand to write the command to the file and not attempt to embed complex quoted content into a single exported environment variable.

coderabbitai · 2025-08-28T07:42:35Z

jenkins/scripts/slurm_run.sh

+if [ "$perfMode" = "true" ]; then
+    if [[ "$stageName" == *PyTorch* ]]; then
+        basePerfFilename="base_perf_pytorch.csv"
+    else
+        basePerfFilename="base_perf.csv"
+    fi
+    basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"
+    echo "Check perf result"
+    python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \
+        $stageName/perf_script_test_results.csv \
+        $basePerfPath
+    echo "Check perf report"
+    python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \
+        --output_path $stageName/report.pdf \
+        --files $stageName/perf_script_test_results.csv \
+        $basePerfPath
+fi


⚠️ Potential issue

Fix perf paths: use jobWorkspace instead of stageName on SLURM node.

Results are written under $jobWorkspace; current paths won’t exist remotely.

- basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename" + basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename" @@ - python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \ - $stageName/perf_script_test_results.csv \ - $basePerfPath + python3 "$llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py" \ + "$jobWorkspace/perf_script_test_results.csv" \ + "$basePerfPath" @@ - python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \ - --output_path $stageName/report.pdf \ - --files $stageName/perf_script_test_results.csv \ - $basePerfPath + python3 "$llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py" \ + --output_path "$jobWorkspace/report.pdf" \ + --files "$jobWorkspace/perf_script_test_results.csv" \ + "$basePerfPath"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [ "$perfMode" = "true" ]; then

if [[ "$stageName" == *PyTorch* ]]; then

basePerfFilename="base_perf_pytorch.csv"

else

basePerfFilename="base_perf.csv"

fi

basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"

echo "Check perf result"

python3 $llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py \

$stageName/perf_script_test_results.csv \

$basePerfPath

echo "Check perf report"

python3 $llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py \

--output_path $stageName/report.pdf \

--files $stageName/perf_script_test_results.csv \

$basePerfPath

fi

if [ "$perfMode" = "true" ]; then

if [[ "$stageName" == *PyTorch* ]]; then

basePerfFilename="base_perf_pytorch.csv"

else

basePerfFilename="base_perf.csv"

fi

basePerfPath="$llmSrcNode/tests/integration/defs/perf/$basePerfFilename"

echo "Check perf result"

python3 "$llmSrcNode/tests/integration/defs/perf/sanity_perf_check.py" \

"$jobWorkspace/perf_script_test_results.csv" \

"$basePerfPath"

echo "Check perf report"

python3 "$llmSrcNode/tests/integration/defs/perf/create_perf_comparison_report.py" \

--output_path "$jobWorkspace/report.pdf" \

--files "$jobWorkspace/perf_script_test_results.csv" \

"$basePerfPath"

fi

🧰 Tools

🪛 Shellcheck (0.10.0)

[warning] 85-85: perfMode is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents

In jenkins/scripts/slurm_run.sh around lines 85 to 101, the perf script currently uses $stageName for result and report paths which do not exist on the SLURM node; change those references to $jobWorkspace (e.g. use "$jobWorkspace/perf_script_test_results.csv" and "$jobWorkspace/report.pdf") and update both python calls to pass the basePerfPath and jobWorkspace paths, quoting variables to handle spaces, so the sanity_perf_check and create_perf_comparison_report operate on files under the remote job workspace.

tensorrt-cicd · 2025-08-28T07:44:32Z

PR_Github #16820 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-28T07:44:36Z

PR_Github #16818 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-08-28T08:48:10Z

PR_Github #16820 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12627 (Partly Tested) completed with status: 'FAILURE'

yuanjingx87 · 2025-08-28T15:22:16Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-28T15:27:22Z

PR_Github #16880 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

jenkins/L0_Test.groovy (1)

324-336: Critical: exporting pytestCommand with embedded quotes corrupts the script; persist to file and read it at runtime

Joining the command and embedding it into export pytestCommand="..." breaks when the content itself contains quotes (e.g., __LUNOWUD="..."). This matches the CI failures on DGX_B200 stages. Write the command to a file and read it back to avoid quoting/escaping hell. Also enable strict bash options.

-                def pytestCommand = getPytestBaseCommandLine(
+                def pytestCommand = getPytestBaseCommandLine(
                     llmSrcNode,
                     stageName,
                     testListPathNode,
                     splits,
                     splitId,
                     perfMode,
                     jobWorkspace,
                     "__PLACEHOLDER_TRTLLM_WHL_PATH__",
                     "__PLACEHOLDER_coverageConfigFile__",
                     pytestUtil
                 ).join(" ")
@@
-                def scriptContent = """#!/bin/bash
+                def scriptContent = """#!/bin/bash
+                    set -euo pipefail
                     export jobWorkspace=$jobWorkspace
                     export tarName=$tarName
                     export llmTarfile=$llmTarfile
                     export llmSrcNode=$llmSrcNode
                     export stageName=$stageName
                     export perfMode=$perfMode
                     export resourcePathNode=$resourcePathNode
                     export nodeCount=$nodeCount
-                    export pytestCommand="$pytestCommand"
+                    # Persist full pytest command to avoid export-time quoting issues
+                    cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD'
+${pytestCommand}
+PYTEST_CMD
+                    export PYTEST_CMD_FILE="$jobWorkspace/.pytest_cmd"
                     export NVIDIA_IMEX_CHANNELS=0
                     chmod +x $scriptRunNode
                 """

Follow-up: slurm_run.sh should prefer reading and executing the command from $PYTEST_CMD_FILE if set (or replace eval with bash -lc "$(<"$PYTEST_CMD_FILE")"). I can send a PR to slurm_run.sh if needed.

#!/bin/bash
# Show vulnerable exports and ensure the new file-based approach is present
rg -nP --type=groovy -C2 'export\s+pytestCommand="?[^"]*"\$' jenkins/L0_Test.groovy
rg -nP --type=groovy -C2 '\.pytest_cmd' jenkins/L0_Test.groovy
# Inspect slurm_run.sh for 'eval' vs 'bash -lc' and optional file read
rg -n 'pytestCommand' jenkins/scripts/slurm_run.sh
rg -n 'eval' jenkins/scripts/slurm_run.sh

Also applies to: 337-371

🧹 Nitpick comments (6)

jenkins/L0_Test.groovy (6)
1-1: Avoid pinning to a personal/dev shared-lib branch in mainline CI

Use a stable tag/branch or make it configurable; dev branches can break CI unpredictably.

135-157: Cleanup should be best-effort and non-fatal

If the workspace was never created or already removed, rm -rf errors could fail the stage. Add || true.
-                    "rm -rf /home/svc_tensorrt/bloom/scripts/${jobUID}"
+                    "rm -rf /home/svc_tensorrt/bloom/scripts/${jobUID} || true"
161-170: Slurm args: single-node should use --gpus-per-node; uneven GPU splits need verification

For nodeCount == 1, prefer --gpus-per-node to align with containerized srun setups.

With uneven splits (e.g., 5 GPUs over 2 nodes), ceiled gpusPerNode=3 with --ntasks=5 and --ntasks-per-node=3 might overconstrain placement. Validate on your cluster.
-    return nodeCount == 1 ? [
-        "--nodes=${nodeCount}",
-        "--gpus=${gpuCount}"
-    ] : [
+    return nodeCount == 1 ? [
+        "--nodes=${nodeCount}",
+        "--gpus-per-node=${gpuCount}"
+    ] : [
         "--nodes=${nodeCount}",
         "--ntasks=${gpuCount}",
         "--ntasks-per-node=${gpusPerNode}",
         "--gpus-per-node=${gpusPerNode}",
     ]
172-230: getPytestBaseCommandLine: avoid hidden global state and fragile quoting

This method reads testFilter[(DETAILED_LOG)] implicitly. Prefer an explicit parameter for determinism and testability.

Returning a token array is good; the later export/quoting is where it breaks (see next comment).

If you want to decouple from globals with minimal churn:
-def getPytestBaseCommandLine(
+def getPytestBaseCommandLine(
     String llmSrc,
     String stageName,
     String testDBList,
     int splits,
     int split_id,
     Boolean perfMode,
     String outputPath,
     String trtllmWheelPath,
     String coverageConfigFile,
-    String pytestUtil = ""
+    String pytestUtil = "",
+    Boolean detailedLog = testFilter[(DETAILED_LOG)]
 ) {
@@
-        testFilter[(DETAILED_LOG)] ? "-s" : "",
+        detailedLog ? "-s" : "",
350-368: Harden launch script and remove unused local var

Add strict bash flags (already added above).

The local outputPath variable is unused; remove it.
-                } else {
-                    String outputPath = "${jobWorkspace}/job-output.log"
+                } else {
                     taskArgs = [
                         *taskArgs,
                     ]
232-233: Unused parameters in runLLMTestlistOnSlurm

skipInstallWheel and cpver are unused in this method. Either use them or drop to avoid confusion.
-def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, perfMode=false, stageName="Undefined", splitId=1, splits=1, gpuCount=1, nodeCount=1, skipInstallWheel=false, cpver="cp312")
+def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, perfMode=false, stageName="Undefined", splitId=1, splits=1, gpuCount=1, nodeCount=1)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between eeb92e1 and 2e08294.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (11 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (10)

jenkins/L0_Test.groovy (10)

266-288: LGTM: remote script staging

Copying slurm_run.sh and the rendered test list to the FE node is clear and retriable.

306-318: LGTM: containerized srun assembly

Using --container-image, --container-workdir, and bind mounts looks correct for trt-llm containerized srun.

319-335: PyTest launcher selection for multi-node is reasonable

Using trtllm-llmapi-launch only when nodeCount > 1 makes sense.

378-387: LGTM: remote execution

Executing the pre-staged launcher over SSH is straightforward and retriable via Utils.exec.

391-392: LGTM: result upload then cleanup in finally

Ordering is correct; ensures artifacts are collected even on failures.

1389-1417: LGTM: on-box pytest assembly with coverage

Local path uses array form (no export), avoiding quoting issues. Coverage config generation looks fine.

1438-1441: LGTM: rerun flow integration

Running pytest, then conditional rerun via parsed fail signatures is wired correctly.

Also applies to: 1445-1445

1818-1822: Verify new DGX_B200 Slurm mapping and partition label

Ensure "b200-trtllm" exists in SlurmConfig and the hardware matches 4-GPU expectations. These stages are the ones failing via /bot; fix in quoting (above) should unblock, but please double-check partition naming.

1847-1850: Verify GB200 single-node Slurm mapping

Confirm "gb200-trtllm" partition is available and correctly wired to the intended rack/pool.

1854-1862: Multi-node SBSA configs: sanity-check MPI settings and resource math

Using --mpi=pmi2 above: confirm PMI2 is available on these clusters.

8 GPUs across 2 nodes implies 4 per-node; ensure getNodeArgs and cluster constraints match.

Also applies to: 1892-1893

tensorrt-cicd · 2025-08-28T16:16:27Z

PR_Github #16880 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12677 (Partly Tested) completed with status: 'SUCCESS'

yuanjingx87 · 2025-08-28T16:57:24Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-28T17:02:31Z

PR_Github #16890 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 4

♻️ Duplicate comments (1)

jenkins/L0_Test.groovy (1)

324-336: Critical: exporting pytestCommand with embedded quotes will break the script; persist to a file and consume at runtime.

The command contains quotes (e.g., __LUNOWUD="..."), which corrupts export pytestCommand="...". Write the full command to a file and have slurm_run.sh read/execute it.

-                def pytestCommand = getPytestBaseCommandLine(
+                def pytestCommand = getPytestBaseCommandLine(
@@
-                def scriptContent = """#!/bin/bash
+                def scriptContent = """#!/bin/bash
+                    set -euo pipefail
                     export jobWorkspace=$jobWorkspace
@@
-                    export nodeCount=$nodeCount
-                    export pytestCommand="$pytestCommand"
-                    export NVIDIA_IMEX_CHANNELS=0
-                    chmod +x $scriptRunNode
+                    export nodeCount=$nodeCount
+                    export NVIDIA_IMEX_CHANNELS=0
+                    # Persist pytest command verbatim to avoid quote corruption
+                    cat > "$jobWorkspace/.pytest_cmd" <<'PYTEST_CMD'
+${pytestCommand}
+PYTEST_CMD
+                    export PYTEST_CMD_FILE="$jobWorkspace/.pytest_cmd"
+                    chmod +x "$scriptRunNode"
@@
-                scriptContent = scriptContent.replaceAll('\t','').stripIndent()
+                scriptContent = scriptContent.stripIndent()

Companion change required in jenkins/scripts/slurm_run.sh (outside this file):

Replace any eval usage with: bash -lc "$(<"${PYTEST_CMD_FILE}")"
Or read the file into a variable and execute with bash -lc "$cmd".

#!/bin/bash
# Verify slurm_run.sh does not use eval and can read PYTEST_CMD_FILE
rg -nC2 -e 'eval\s' -e 'PYTEST_CMD_FILE' jenkins/scripts/slurm_run.sh || true

Also applies to: 337-349, 350-368, 369-371

🧹 Nitpick comments (4)

jenkins/L0_Test.groovy (4)
172-230: Make getPytestBaseCommandLine robust: drop empty args; avoid hidden globals.

Filter out empty strings so argv is clean (pytestUtil may be empty).

Optional (nice): pass a detailedLog flag instead of reading global testFilter.
 def getPytestBaseCommandLine(
@@
-        testFilter[(DETAILED_LOG)] ? "-s" : "",
+        testFilter[(DETAILED_LOG)] ? "-s" : "",
@@
-    return testCmdLine as String[]
+    // remove blanks/nulls
+    testCmdLine = testCmdLine.findAll { it?.toString().trim() }
+    return testCmdLine as String[]
319-323: Avoid injecting empty argv element for pytestUtil.

Minor cleanup: don’t pass a blank arg.
-                String pytestUtil = ""
-                if (nodeCount > 1) {
-                    pytestUtil = "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch"
-                }
+                String pytestUtil = (nodeCount > 1) ? "$llmSrcNode/tensorrt_llm/llmapi/trtllm-llmapi-launch" : null
(Note: with the previous argv filtering in getPytestBaseCommandLine, nulls/blanks are dropped.)

360-367: Remove unused variable.

outputPath is unused.
-                    String outputPath = "${jobWorkspace}/job-output.log"
1405-1417: Local run: argv construction is good; optional: align with file-based command execution for parity.

Not required, but you can mirror the file-based approach here for consistency and easier debugging.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2e08294 and 19ffc0d.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (11 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

jenkins/L0_Test.groovy (1)

1818-1822: Ensure SlurmConfig & mapping script support new b200-trtllm/gb200-trtllm keys. Add these labels to SlurmConfig.partitionConfig and clusterConfig, and verify the test-to-stage mapping script (path tests/integration/defs/scripts/test_to_stage_mapping.py appears missing) correctly resolves the keys at L0_Test.groovy lines 1818–1822, 1847–1850, 1855–1862 and 1892–1893.

coderabbitai · 2025-08-28T17:05:24Z

jenkins/L0_Test.groovy

@@ -1,4 +1,4 @@
-@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
+@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _


💡 Verification agent

🧩 Analysis chain

Do not merge with a dev shared library reference. Revert to main.

Pinning to a feature branch will break main after merge.

-@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

🏁 Script executed:

#!/bin/bash # Verify no other dev shared-lib references exist grep -R "dev-yuanjingx-slurm_refactor" -n . # Confirm @Library usage in the Jenkinsfile under review grep -R "@Library" -n jenkins/L0_Test.groovy || true # Preview the first few lines of jenkins/L0_Test.groovy head -n 5 jenkins/L0_Test.groovy

Length of output: 564

Revert bloom-jenkins-shared-lib to @main
Pinning to dev-yuanjingx-slurm_refactor will break the main pipeline—update line 1 accordingly.

-@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _ +@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _

@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _

🤖 Prompt for AI Agents

In jenkins/L0_Test.groovy around line 1, the Jenkins shared library reference is pinned to dev-yuanjingx-slurm_refactor which will break the main pipeline; change the library ref to use the stable main branch by replacing '@dev-yuanjingx-slurm_refactor' with '@main' so the line becomes @Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _ and save the file.

coderabbitai · 2025-08-28T17:05:24Z

jenkins/L0_Test.groovy

-    return [
+    return nodeCount == 1 ? [
+        "--nodes=${nodeCount}",
+        "--gpus=${gpuCount}"
+    ] : [
        "--nodes=${nodeCount}",
        "--ntasks=${gpuCount}",
        "--ntasks-per-node=${gpusPerNode}",
        "--gpus-per-node=${gpusPerNode}",
-    ].join(" ")
+    ]
+}
+


🛠️ Refactor suggestion

❓ Verification inconclusive

Harden getNodeArgs: handle invalid inputs; avoid deprecated BigDecimal rounding.

Avoid ROUND_CEILING and return null for invalid node/gpu counts (caller already checks for null).

-def getNodeArgs(int nodeCount, int gpuCount) { - int gpusPerNode = ((gpuCount / nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue() +def getNodeArgs(int nodeCount, int gpuCount) { + if (nodeCount < 1 || gpuCount < 1) { + return null + } + int gpusPerNode = Math.ceil(gpuCount / (double) nodeCount) as int return nodeCount == 1 ? [ "--nodes=${nodeCount}", "--gpus=${gpuCount}" ] : [ "--nodes=${nodeCount}", "--ntasks=${gpuCount}", "--ntasks-per-node=${gpusPerNode}", "--gpus-per-node=${gpusPerNode}", ] }

Harden getNodeArgs: validate inputs & avoid deprecated BigDecimal rounding.

Guard against invalid counts (nodeCount < 1 || gpuCount < 1) by returning null.

Replace
((gpuCount/nodeCount) as BigDecimal).setScale(0, BigDecimal.ROUND_CEILING).intValue()
with
Math.ceil(gpuCount/(double)nodeCount) as int.

Verify that every caller properly handles a null return from getNodeArgs.

🤖 Prompt for AI Agents

In jenkins/L0_Test.groovy around lines 160 to 170, harden getNodeArgs by first validating inputs and returning null when nodeCount < 1 || gpuCount < 1, replace the deprecated BigDecimal rounding expression with Math.ceil(gpuCount/(double)nodeCount) cast to int for gpusPerNode, and update callers to handle a possible null return (e.g., check for null and abort or propagate error) so invalid inputs don't produce incorrect args.

coderabbitai · 2025-08-28T17:05:24Z

jenkins/L0_Test.groovy

⚠️ Potential issue

Don’t pass the password on the command line. Use sshpass -e.

Prevents leaking secrets via process list/logging.

-Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,) +Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

stage("[${stageName}] Initializing Test") {

// Create Job Workspace folder in Frontend Node

Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)

stage("[${stageName}] Initializing Test") {

// Create Job Workspace folder in Frontend Node

Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)

🤖 Prompt for AI Agents

In jenkins/L0_Test.groovy around lines 270 to 273, do not interpolate the plaintext password into the sshpass command; instead switch to sshpass -e and supply the password via the SSHPASS environment variable (use Jenkins credentials binding or withEnv) so the password is not visible on the command line or process list. Update the Utils.exec invocation to remove '${remote.passwd}' from the command string and ensure remote.passwd is injected into the process environment (or retrieved from Jenkins credentials) before running sshpass -e ssh ..., keeping the same COMMON_SSH_OPTIONS, remote user/host and mkdir -p jobWorkspace and retaining numRetries.

coderabbitai · 2025-08-28T17:05:24Z

jenkins/L0_Test.groovy

    }
 }


💡 Verification agent

🧩 Analysis chain

Re-enable cleanup, gated by a debug flag.

Leaving remote workspaces indefinitely will leak disk.

- //cleanUpNodeResources(pipeline, cluster, jobUID) + if (!(env.KEEP_SLURM_WS_FOR_DEBUG?.toBoolean())) { + cleanUpNodeResources(pipeline, cluster, jobUID) + } else { + echo "KEEP_SLURM_WS_FOR_DEBUG=true; preserving remote workspace for ${jobUID}" + }

🏁 Script executed:

#!/bin/bash # Locate definition of the cleanup helper rg -n 'def cleanUpNodeResources' -g '*.groovy' # Find all invocations of the cleanup call rg -n 'cleanUpNodeResources' -g '*.groovy' # Check for existing debug flag usage or declaration rg -n 'KEEP_SLURM_WS_FOR_DEBUG' -g '*.groovy'

Length of output: 430

Re-enable cleanup behind a debug flag in jenkins/L0_Test.groovy:391
Replace the commented-out call with:

if (!env.KEEP_SLURM_WS_FOR_DEBUG?.toBoolean()) { cleanUpNodeResources(pipeline, cluster, jobUID) } else { echo "KEEP_SLURM_WS_FOR_DEBUG=true; preserving remote workspace for ${jobUID}" }

Also add and document the KEEP_SLURM_WS_FOR_DEBUG environment variable or pipeline parameter.

🤖 Prompt for AI Agents

In jenkins/L0_Test.groovy around lines 391-392, re-enable node cleanup but guard it with a debug flag: replace the commented-out cleanUpNodeResources call with a conditional that checks env.KEEP_SLURM_WS_FOR_DEBUG (toBoolean) and only calls cleanUpNodeResources(pipeline, cluster, jobUID) when the flag is false; otherwise emit an echo that the workspace is preserved for the jobUID. Also add a documented pipeline parameter or environment variable definition for KEEP_SLURM_WS_FOR_DEBUG (default false) in the pipeline configuration or README so users know how to enable workspace preservation for debugging.

tensorrt-cicd · 2025-08-28T18:37:37Z

PR_Github #16890 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12687 (Partly Tested) completed with status: 'SUCCESS'

yuanjingx87 · 2025-08-29T08:48:21Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

yuanjingx87 · 2025-08-29T09:17:37Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-29T09:22:44Z

PR_Github #16968 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-29T10:20:11Z

PR_Github #16968 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12737 (Partly Tested) completed with status: 'FAILURE'

…convert to Git LFS

…vert to Git LFS

…nvert to Git LFS

consolidate test code for slurm jobs and regular jobs Signed-off-by: Yuanjing Xue <[email protected]>

Signed-off-by: Yuanjing Xue <[email protected]>

yuanjingx87 · 2025-08-29T14:56:48Z

/bot run --stage-list "DGX_B200-4_GPUs-PyTorch-Post-Merge-1" --disable-fail-fast

tensorrt-cicd · 2025-08-29T15:01:50Z

PR_Github #16999 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-29T15:48:51Z

PR_Github #16999 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12764 (Partly Tested) completed with status: 'SUCCESS'

yuanjingx87 requested review from a team as code owners August 27, 2025 19:07

yuanjingx87 requested review from tburt-nv and xinhe-nv August 27, 2025 19:07

yuanjingx87 marked this pull request as draft August 27, 2025 19:07

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

yuanjingx87 force-pushed the user/yuanjingx/slurm_test_tmp branch from e69b735 to eeb92e1 Compare August 28, 2025 07:33

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

yuanjingx87 force-pushed the user/yuanjingx/slurm_test_tmp branch from 19ffc0d to adcb573 Compare August 29, 2025 09:17

yuanjingx87 added 12 commits August 29, 2025 07:54

docs/source/blogs/media/tech_blog10_baseline_performance_detail.png: …

47e4d45

…convert to Git LFS

docs/source/blogs/media/tech_blog10_context_wait_performance.png: con…

a6540f3

…vert to Git LFS

docs/source/blogs/media/tech_blog10_full_strategy_performance.png: co…

721ed96

…nvert to Git LFS

Refactor L0 Test code

91a8bc1

consolidate test code for slurm jobs and regular jobs Signed-off-by: Yuanjing Xue <[email protected]>

minor fix

9ebe9e6

Signed-off-by: Yuanjing Xue <[email protected]>

update gb200-single config

129754e

update method name according to bsl update

3fc0289

switch to main branch for bsl

42c9b83

use srun for single node slurm job

7a9f26c

fix lib

8de342f

disable cleanup

af84fbf

fix

061fcba

yuanjingx87 force-pushed the user/yuanjingx/slurm_test_tmp branch from adcb573 to 061fcba Compare August 29, 2025 14:56

yuanjingx87 closed this Sep 8, 2025

	echo "HOST_NODE_NAME = $hostname ; GPU_UUIDS = =$gpuUuids ; STAGE_NAME = $stageName"
	echo "HOST_NODE_NAME = $(hostname) ; GPU_UUIDS = $gpuUuids ; STAGE_NAME = $stageName"

	cd $llmSrcNode/tests/integration/defs
	cd "$llmSrcNode/tests/integration/defs" \|\| { echo "cd failed: $llmSrcNode/tests/integration/defs" >&2; exit 1; }

		@@ -1,4 +1,4 @@
		@Library(['bloom-jenkins-shared-lib@main', 'trtllm-jenkins-shared-lib@main']) _
		@Library(['bloom-jenkins-shared-lib@dev-yuanjingx-slurm_refactor', 'trtllm-jenkins-shared-lib@main']) _

-            stage("[${stageName}] Initializing Test") {
-                // Create Job Workspace folder in Frontend Node
-                Utils.exec(pipeline, script: "sshpass -p '${remote.passwd}' ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)
+            stage("[${stageName}] Initializing Test") {
+                // Create Job Workspace folder in Frontend Node
+                Utils.exec(pipeline, script: "SSHPASS='${remote.passwd}' sshpass -e ssh ${COMMON_SSH_OPTIONS} ${remote.user}@${remote.host} 'mkdir -p ${jobWorkspace}'", numRetries: 3,)

User/yuanjingx/slurm test tmp #7309

User/yuanjingx/slurm test tmp #7309

Uh oh!

Conversation

yuanjingx87 commented Aug 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

yuanjingx87 commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

yuanjingx87 commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

yuanjingx87 commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

yuanjingx87 commented Aug 28, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

yuanjingx87 commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading