[None][opt] Add batch waiting when scheduling #7287

yunruis · 2025-08-27T08:21:44Z

Summary by CodeRabbit

New Features
- Optional batch-wait controls for scheduling: iteration-based timeout and token-ratio threshold (disabled by default).
- Configurable max token limit (default 8192) used for batching thresholds; scheduler may defer context processing until thresholds met to improve GPU utilization.
Bug Fixes / Validation
- Input validation added: iterations >= 0 and token ratio constrained to 0–1.
Tests
- Added integration test coverage for batch-waiting; duplicate test entries were also introduced.

Description

to open the feature, add parameter to config.yaml. For example

batch_wait_timeout_iters: 20
batch_wait_max_tokens_ratio: 0.75

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-27T08:21:51Z

📝 Walkthrough

Walkthrough

Adds three batch-waiting configuration fields to PyTorchConfig and TorchLlmArgs, wires defaults in ADEngine shim, implements internal batch-waiting state and a _waiting_requests helper in PyExecutor to defer or release context scheduling based on iteration and token-ratio thresholds, and adds parameterized integration tests and API-stability references.

Changes

Cohort / File(s)	Summary of changes
PyTorch backend config `tensorrt_llm/_torch/pyexecutor/config.py`	Added fields to `PyTorchConfig`: `max_num_tokens: int = 8192`, `batch_wait_timeout_iters: int = 0`, `batch_wait_max_tokens_ratio: float = 0`.
Executor scheduling logic `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Added batch-waiting state (`max_num_tokens`, `batch_wait_timeout_iters`, `batch_wait_max_tokens_ratio`, `enable_batch_waiting`, `batch_wait_iters_count`), implemented internal helper `_waiting_requests(context_requests, generation_requests)` to defer or release context scheduling based on iteration and token-ratio thresholds, integrated into `_schedule` when attention-DP is disabled and generation requests exist (guard added to avoid waiting when no generation requests).
LLM args and validation `tensorrt_llm/llmapi/llm_args.py`	Added `TorchLlmArgs` fields `batch_wait_timeout_iters` and `batch_wait_max_tokens_ratio` with validators (`batch_wait_timeout_iters >= 0`, `0 <= batch_wait_max_tokens_ratio <= 1`) and propagated them into `PyTorchConfig` via `get_pytorch_backend_config`.
Auto-deploy shim `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`	Initialized defaults on `pytorch_backend_config`: `batch_wait_timeout_iters = 0`, `batch_wait_max_tokens_ratio = 0.0`, `max_num_tokens = 8192`.
Integration tests `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Added parameterized `test_nvfp4_batch_waiting` in DeepSeekV3Lite tests exercising `batch_wait_timeout_iters` and `batch_wait_max_tokens_ratio` under NVFP4 MOE configs; test appears duplicated in file.
Test roster `tests/integration/test_lists/qa/llm_function_full.txt`	Inserted new test roster entry for DeepSeekV3Lite including `batch_wait_timeout_iters=10` and `batch_wait_max_tokens_ratio=0.75`.
API stability reference `tests/unittest/api_stability/references/llm.yaml`	Recorded new `__init__` parameters: `batch_wait_timeout_iters` (int, default 0) and `batch_wait_max_tokens_ratio` (float, default 0).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant TorchLlmArgs
  participant PyTorchConfig
  participant PyExecutor
  participant Scheduler

  Client->>TorchLlmArgs: construct with batch_wait_* params
  TorchLlmArgs->>TorchLlmArgs: validate params
  TorchLlmArgs->>PyTorchConfig: get_pytorch_backend_config(..., batch_wait_*, max_num_tokens)
  PyTorchConfig-->>PyExecutor: config on init
  PyExecutor->>PyExecutor: set enable_batch_waiting, batch_wait_iters_count=0

  loop scheduling step
    PyExecutor->>Scheduler: collect generation_requests + context_requests
    alt attention_dp disabled AND enable_batch_waiting AND generation_requests exist
      PyExecutor->>PyExecutor: _waiting_requests(context_requests, generation_requests)
      alt stop conditions met (iters >= timeout_iters OR total_tokens >= ratio * max_num_tokens)
        Note right of PyExecutor: release deferred context requests\nreset batch_wait_iters_count
        PyExecutor-->>Scheduler: schedule context + generation
      else still waiting
        Note right of PyExecutor: increment batch_wait_iters_count\ndefers context scheduling
        PyExecutor-->>Scheduler: schedule generation only
      end
    else
      PyExecutor-->>Scheduler: schedule context + generation (no waiting)
    end
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

[None][opt] ADP schedule balance optimization #6061 — modifies PyExecutor scheduling and backend config with batching/waiting fields and request-deferral behavior; closely related to these batch-waiting changes.

Suggested labels

Community want to contribute

Suggested reviewers

QiJune
pcastonguay
chzblych
syuoni
netanel-haber
lucaslie

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

tensorrt_llm/_torch/pyexecutor/config.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Guard against None/invalid max_num_tokens when TP waiting is enabled.

If max_num_tokens is None or <= 0, the wait threshold math will raise or misbehave.

Apply this diff to fail fast:

         self.max_draft_len = max_draft_len
-        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
+        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
         self.print_log = model_engine.pytorch_backend_config.print_iter_log
@@
-        self.attention_tp_enable_waiting = model_engine.pytorch_backend_config.attention_tp_enable_waiting
+        self.attention_tp_enable_waiting = model_engine.pytorch_backend_config.attention_tp_enable_waiting
         self.attention_tp_timeout_iters = model_engine.pytorch_backend_config.attention_tp_timeout_iters
         self.attention_tp_max_token_ratio = model_engine.pytorch_backend_config.attention_tp_max_token_ratio
+        if self.attention_tp_enable_waiting and (self.max_num_tokens is None or self.max_num_tokens <= 0):
+            raise ValueError(
+                "attention_tp_enable_waiting requires a positive max_num_tokens. "
+                "Set TorchLlmArgs.max_num_tokens or ensure build_config.max_num_tokens is propagated.")

🧹 Nitpick comments (7)

tensorrt_llm/_torch/pyexecutor/config.py (2)
1-1: Missing NVIDIA copyright header (2025).

Please prepend the standard NVIDIA copyright header to comply with repo guidelines.

53-57: Default max_num_tokens looks off; ensure it's intentional (4086 vs 4096).

If this is a typo, fix it; otherwise document why 4086 is chosen. Prefer deriving from BuildConfig to avoid drift.

Example fix:
-    max_num_tokens: int = 4086
+    # Derive at runtime via TorchLlmArgs/build_config; keep a sane default.
+    max_num_tokens: int = 4096
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-1: Missing NVIDIA copyright header (2025).

Please add the standard header.
tensorrt_llm/llmapi/llm_args.py (3)
1-1: Missing NVIDIA copyright header (2025).

Please add the standard header.

216-229: LGTM: AttentionTPConfig added with sane defaults and from_dict.

Consider clarifying in the docstring that TP waiting applies only when attention_dp is disabled.

2206-2210: Line too long; tighten field description (E501).

Shorten to avoid lint failures.

Apply:
-    attention_tp_config: Optional[AttentionTPConfig] = Field(
-        default=None,
-        description="Optimized load for the TP Attention.",
-        status="beta")
+    attention_tp_config: Optional[AttentionTPConfig] = Field(
+        default=None,
+        description="Token-parallel attention waiting controls.",
+        status="beta")
tensorrt_llm/llmapi/__init__.py (1)

1-1: Missing NVIDIA copyright header (2025).

Please add the standard header.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f167b1f and f26d6e9.

📒 Files selected for processing (4)

tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
tensorrt_llm/llmapi/__init__.py (2 hunks)
tensorrt_llm/llmapi/llm_args.py (5 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/llmapi/__init__.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/llmapi/__init__.py

🧠 Learnings (1)

📚 Learning: 2025-08-26T09:37:10.431Z

Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.431Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

tensorrt_llm/llmapi/llm_args.py

🧬 Code graph analysis (3)

tensorrt_llm/llmapi/llm_args.py (2)

tensorrt_llm/builder.py (2)

default (50-58)

from_dict (572-677)

tensorrt_llm/_torch/pyexecutor/config.py (1)

PyTorchConfig (17-120)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (282-424)

tensorrt_llm/llmapi/__init__.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

AttentionTPConfig (211-222)

🪛 Ruff (0.12.2)

tensorrt_llm/llmapi/llm_args.py

2210-2210: Line too long (165 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1277-1277: Line too long (172 > 120)

(E501)

1284-1284: Line too long (282 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

239-240: Good: initialized TP waiting iteration counter.

1298-1302: Gating TP waiting only when attention DP is disabled is correct.

tensorrt_llm/llmapi/llm_args.py (1)

2649-2650: LGTM: wiring attention_tp_config through update_llm_args_with_extra_dict.

tensorrt_llm/llmapi/__init__.py (2)

7-8: LGTM: exported AttentionTPConfig from llmapi.

61-62: LGTM: added AttentionTPConfig to all.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt_llm/llmapi/llm_args.py

yunruis · 2025-08-27T13:21:18Z

/bot run

tensorrt-cicd · 2025-08-27T13:26:40Z

PR_Github #16692 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1266-1287: Fix token accounting and NVTX label in _waiting_requests (and wrap long line).

NVTX label is incorrect ("_schedule" on _waiting_requests), hurting trace readability.
Token counting uses len(req.get_tokens(0)), which overcounts historical tokens and can prematurely release or stall waiting. Count per-iteration cost instead:
- Context: sum context_chunk_size for scheduled context requests this iter.
- Generation: approximate 1 token/request (optionally times beam width).
Also address E501 long line.

Apply:

-@nvtx_range("_schedule")
+@nvtx_range("_waiting_requests")
 def _waiting_requests(self, context_requests: list[LlmRequest],
                       generation_requests: list[LlmRequest]):
     if not self.enable_batch_waiting:
         return context_requests
 
     waited_context_requests = []
-    stop_waiting = False
-    num_scheduled_ctx_tokens = sum(
-        len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-    num_scheduled_gen_tokens = sum(
-        len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-    num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-    stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+    # Per-iteration token costs:
+    num_scheduled_ctx_tokens = sum(
+        getattr(req, "context_chunk_size", 1) for req in context_requests
+    )
+    # Approximate 1 token per generation request this iteration.
+    # If desired, incorporate beam width:
+    # num_scheduled_gen_tokens = sum(getattr(req.sampling_config, "beam_width", 1) for req in generation_requests)
+    num_scheduled_gen_tokens = len(generation_requests)
+    num_scheduled_tokens = (
+        num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+    )
+
+    stop_waiting = (
+        self.batch_wait_iters_count >= self.batch_wait_timeout_iters
+        or num_scheduled_tokens >= int(
+            self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        )
+    )
     if stop_waiting:
         waited_context_requests = context_requests
         self.batch_wait_iters_count = 0
     else:
         self.batch_wait_iters_count += 1
     return waited_context_requests

tensorrt_llm/llmapi/llm_args.py (1)

2612-2615: Propagate a non-None max_num_tokens to the PyTorch backend

Ensure backend receives a concrete value (fallback to build_config) so ratio-based waiting can compute thresholds.

             attention_dp_batching_wait_iters=self.attention_dp_config.
             batching_wait_iters if self.attention_dp_config is not None else
             AttentionDpConfig.model_fields['batching_wait_iters'].default,
+            max_num_tokens=self.max_num_tokens
+            if self.max_num_tokens is not None
+            else self.build_config.max_num_tokens,
             batch_wait_timeout_ms=self.batch_wait_timeout_ms,
             batch_wait_timeout_iters=self.batch_wait_timeout_iters,
             batch_wait_max_tokens_ratio=self.batch_wait_max_tokens_ratio,

🧹 Nitpick comments (5)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1298-1305: Guard condition reads correctly; consider explicit short-circuit for readability.

No functional change needed. Optional clarity:
-        if not self.enable_attention_dp and self.enable_batch_waiting and len(
-                scheduler_output.context_requests) > 0 and len(
-                    scheduler_output.generation_requests) > 0:
+        if (not self.enable_attention_dp and self.enable_batch_waiting
+                and scheduler_output.context_requests
+                and scheduler_output.generation_requests):
tensorrt_llm/llmapi/llm_args.py (4)
2243-2254: Use int for iteration-based timeout; tighten description

Iterations should be integral and the description can be clearer.

Apply:
-    batch_wait_timeout_iters: float = Field(
+    batch_wait_timeout_iters: int = Field(
         default=0,
         description=
-        "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.",
+        "Maximum number of scheduler iterations to wait to accumulate incoming requests for better GPU utilization. If > 0, the scheduler may delay batch processing up to this budget; if 0, disables iteration-based batching delays.",
         status="prototype")
 
     batch_wait_max_tokens_ratio: float = Field(
         default=0,
         description=
-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token accumulation threshold ratio for batch scheduling. If > 0, accumulate requests until total tokens reach batch_wait_max_tokens_ratio * max_num_tokens. This improves GPU utilization by ensuring adequate batch sizes. If 0, disables token-based batching delays.",
         status="prototype")
2519-2521: Fix validation message (0 is allowed)

Error text should say “greater than or equal to 0.”
-        if self.batch_wait_timeout_ms < 0:
-            raise ValueError("batch_wait_timeout_ms must be greater than 0")
+        if self.batch_wait_timeout_ms < 0:
+            raise ValueError("batch_wait_timeout_ms must be greater than or equal to 0")
2523-2528: Fix validation message (0 is allowed) for iters

Keep logic, update message; aligns with docs.
-        if self.batch_wait_timeout_iters < 0:
-            raise ValueError("batch_wait_timeout_iters must be greater than 0")
+        if self.batch_wait_timeout_iters < 0:
+            raise ValueError("batch_wait_timeout_iters must be greater than or equal to 0")
2243-2254: Clarify independent batching thresholds and their execution order

The two parameters govern separate stages of the PyTorch executor’s batching logic and do not override one another. Specifically:

batch_wait_timeout_ms applies in ExecutorRequestQueue._get_from_request_queue (tensorrt_llm/_torch/pyexecutor/executor_request_queue.py). When > 0, the request queue will wait up to that many milliseconds (or until max_batch_size is reached) before flushing items to the executor.

batch_wait_timeout_iters (and batch_wait_max_tokens_ratio) applies later in PyTorchExecutor._waiting_requests (tensorrt_llm/_torch/pyexecutor/py_executor.py). When enable_batch_waiting is true, the scheduler will hold generation requests for up to the configured number of iterations (or until token-count threshold) before releasing them.

Because these delays occur at different points in the pipeline, both will be applied sequentially if set > 0 (queue-level ms wait first, then scheduler-level iters wait). Please update the field descriptions in llm_args.py to:

Call out that they control distinct phases

Describe the order in which they are evaluated

Note that setting both introduces two successive batching delays rather than one “wins” over the other

This will ensure users understand how the executor actually behaves.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f26d6e9 and f58f5a6.

📒 Files selected for processing (5)

tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/unittest/api_stability/references/llm.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/pyexecutor/config.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

🧠 Learnings (1)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py

🧬 Code graph analysis (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

tensorrt_llm/_utils.py (1)

nvtx_range (843-862)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

tests/integration/defs/conftest.py (2)

parametrize_with_ids (1786-1811)

llm_models_root (77-83)

tensorrt_llm/llmapi/llm_args.py (7)

KvCacheConfig (946-1077)

TorchCompileConfig (2108-2155)

CudaGraphConfig (108-165)

MoeConfig (168-196)

MTPDecodingConfig (521-556)

quant_config (2328-2331)

quant_config (2334-2335)

tensorrt_llm/llmapi/llm.py (1)

LLM (1011-1027)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-46)

tests/integration/defs/accuracy/accuracy_core.py (3)

GSM8K (293-308)

evaluate (147-206)

evaluate (707-717)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1280-1280: Line too long (165 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (3)

tests/unittest/api_stability/references/llm.yaml (1)

134-141: LGTM: new batch-waiting knobs exposed in API reference.

Defaults and statuses look consistent with prototype rollout.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1512-1519: Use the parametrized fp8kv instead of hardcoding.

The conditional below already respects fp8kv. Removing the hardcoded assignment (per the previous diff) is sufficient; no further change needed.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Ensure max_num_tokens is always positive when ratio-based waiting is enabled.

If max_num_tokens were 0/None, the ratio threshold becomes 0, releasing immediately. If that’s unintended, clamp or disable ratio-based waiting when max_num_tokens <= 0.

Would you like a follow-up patch to guard enable_batch_waiting with self.max_num_tokens > 0?

tensorrt_llm/llmapi/llm_args.py

tests/integration/defs/accuracy/test_llm_api_pytorch.py

tests/unittest/api_stability/references/llm.yaml

yunruis · 2025-08-27T13:52:45Z

/bot run

tensorrt-cicd · 2025-08-27T13:58:07Z

PR_Github #16692 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12530 completed with status: 'FAILURE'

tensorrt-cicd · 2025-08-27T13:58:26Z

PR_Github #16696 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-27T14:28:54Z

PR_Github #16696 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12531 completed with status: 'FAILURE'

yunruis · 2025-08-28T02:09:57Z

/bot run

tensorrt-cicd · 2025-08-28T02:15:14Z

PR_Github #16769 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Harden config defaults to avoid TypeError and make enable_batch_waiting robust

If any of these fields are None in pytorch_backend_config, comparisons like > 0 raise. Cast with safe fallbacks.

-        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
+        self.max_num_tokens = int(getattr(
+            model_engine.pytorch_backend_config, "max_num_tokens", 0) or 0)
@@
-        self.batch_wait_timeout_iters = model_engine.pytorch_backend_config.batch_wait_timeout_iters
-        self.batch_wait_max_tokens_ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio
-        self.enable_batch_waiting = self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0
+        self.batch_wait_timeout_iters = int(getattr(
+            model_engine.pytorch_backend_config, "batch_wait_timeout_iters", 0) or 0)
+        self.batch_wait_max_tokens_ratio = float(getattr(
+            model_engine.pytorch_backend_config, "batch_wait_max_tokens_ratio", 0.0) or 0.0)
+        self.enable_batch_waiting = (
+            self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0.0
+        )

♻️ Duplicate comments (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1266-1286: Fix token accounting; count per-iteration costs (ctx chunk size, ~1 token/gen); also guard when max_num_tokens <= 0

Using len(req.get_tokens(0)) counts history and breaks the threshold; it can cause premature stop or no-op waiting. Also break long lines (E501).

-    def _waiting_requests(self, context_requests: list[LlmRequest],
-                          generation_requests: list[LlmRequest]):
+    def _waiting_requests(self, context_requests: list[LlmRequest],
+                          generation_requests: list[LlmRequest]):
         if not self.enable_batch_waiting:
             return context_requests
 
         waited_context_requests = []
-        stop_waiting = False
-        num_scheduled_ctx_tokens = sum(
-            len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-        num_scheduled_gen_tokens = sum(
-            len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-        num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-        stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        # Per-iteration token costs:
+        num_scheduled_ctx_tokens = sum(
+            req.context_chunk_size for req in context_requests
+        )
+        # Approximate 1 token per generation request this iteration
+        num_scheduled_gen_tokens = len(generation_requests)
+        num_scheduled_tokens = (
+            num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+        )
+        # Ratio threshold only meaningful if both ratio > 0 and max_num_tokens > 0
+        ratio_threshold = (
+            int(self.batch_wait_max_tokens_ratio * self.max_num_tokens)
+            if (self.batch_wait_max_tokens_ratio > 0.0 and
+                self.max_num_tokens > 0)
+            else float("inf")
+        )
+        stop_waiting = (
+            self.batch_wait_iters_count >= self.batch_wait_timeout_iters
+            or num_scheduled_tokens >= ratio_threshold
+        )
         if stop_waiting:
             waited_context_requests = context_requests
             self.batch_wait_iters_count = 0
         else:
             self.batch_wait_iters_count += 1
         return waited_context_requests

tensorrt_llm/llmapi/llm_args.py (2)

2529-2536: Require max_num_tokens when ratio-based waiting is enabled

Ratio-based waiting needs a concrete threshold; enforce effective max_num_tokens (> 0). This prevents runtime issues in the executor.

Apply:

     @model_validator(mode='after')
     def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
         if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
             raise ValueError(
                 "batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
             )
+        if self.batch_wait_max_tokens_ratio > 0:
+            effective_max_tokens = self.max_num_tokens or (
+                self.build_config.max_num_tokens if self.build_config else None
+            )
+            if effective_max_tokens is None or effective_max_tokens <= 0:
+                raise ValueError(
+                    "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). "
+                    "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens."
+                )
         return self

2612-2615: Propagate a non-None max_num_tokens to PyTorchConfig

Backend needs a concrete threshold when ratio-based waiting is used; pass max_num_tokens with a build_config fallback.

Apply:

             attention_dp_batching_wait_iters=self.attention_dp_config.
             batching_wait_iters if self.attention_dp_config is not None else
             AttentionDpConfig.model_fields['batching_wait_iters'].default,
+            max_num_tokens=self.max_num_tokens
+            if self.max_num_tokens is not None
+            else self.build_config.max_num_tokens,
             batch_wait_timeout_ms=self.batch_wait_timeout_ms,
             batch_wait_timeout_iters=self.batch_wait_timeout_iters,
             batch_wait_max_tokens_ratio=self.batch_wait_max_tokens_ratio,

Run to verify PyTorchConfig signature and types:

#!/bin/bash
# Verify PyTorchConfig fields exist and types are as expected.
rg -nP 'class\s+PyTorchConfig\b' -A120 tensorrt_llm/_torch/pyexecutor/config.py
rg -nP '\bmax_num_tokens\b|\bbatch_wait_timeout_iters\b|\bbatch_wait_max_tokens_ratio\b' tensorrt_llm/_torch/pyexecutor/config.py

🧹 Nitpick comments (4)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1486-1527: Ensure ratio-based waiting is actually exercised: set max_num_tokens when batch_wait_max_tokens_ratio > 0

If max_num_tokens defaults to 0/None, the condition tokens >= ratio * max_num_tokens collapses to tokens >= 0 and stops waiting immediately; the test won’t meaningfully cover ratio behavior. Guard by setting a sane max_num_tokens only when ratio > 0, and (optionally) assert the backend picked up the values.

@@
-        with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp",
+        llm_kwargs = {}
+        if batch_wait_max_tokens_ratio > 0:
+            # Keep moderate to avoid OOM while making the ratio threshold meaningful.
+            llm_kwargs["max_num_tokens"] = 16384
+        with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp",
                  kv_cache_config=kv_cache_config,
                  **pytorch_config,
                  enable_attention_dp=False,
-                 speculative_config=mtp_config) as llm:
+                 speculative_config=mtp_config,
+                 **llm_kwargs) as llm:
             assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4
+            # Optional: verify wiring
+            assert llm.args.pytorch_backend_config.batch_wait_timeout_iters == batch_wait_timeout_iters
+            assert llm.args.pytorch_backend_config.batch_wait_max_tokens_ratio == batch_wait_max_tokens_ratio

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1297-1304: Gate waiting only when both ctx and gen exist: OK, but reference local scheduled_context_requests for clarity

Logic is correct; minor readability improvement below.

-        if not self.enable_attention_dp and self.enable_batch_waiting and len(
-                scheduler_output.context_requests) > 0 and len(
-                    scheduler_output.generation_requests) > 0:
-            scheduled_context_requests = self._waiting_requests(
-                scheduler_output.context_requests,
-                scheduler_output.generation_requests)
+        if (not self.enable_attention_dp and self.enable_batch_waiting
+                and scheduler_output.context_requests
+                and scheduler_output.generation_requests):
+            scheduled_context_requests = self._waiting_requests(
+                scheduler_output.context_requests,
+                scheduler_output.generation_requests
+            )

tensorrt_llm/llmapi/llm_args.py (2)

2249-2254: Nit: fix spacing/punctuation in description

Minor grammar/spacing for clarity.

Apply:

-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token accumulation threshold ratio for batch scheduling optimization. If > 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization by ensuring adequate batch sizes. If 0, disables token-based batching delays.",

2523-2528: Validator message should say '>= 0'; optionally warn if both ms- and iter-based waits are enabled

Error text should match the logic (0 is allowed).
Optional: warn if both knobs are > 0 to avoid compounded delays.

Apply:

     @model_validator(mode='after')
     def validate_batch_wait_timeout_iters(self) -> 'TorchLlmArgs':
-        if self.batch_wait_timeout_iters < 0:
-            raise ValueError("batch_wait_timeout_iters must be greater than 0")
+        if self.batch_wait_timeout_iters < 0:
+            raise ValueError("batch_wait_timeout_iters must be greater than or equal to 0")
+        # Guardrail: avoid unintentionally enabling two waiting mechanisms.
+        if self.batch_wait_timeout_ms > 0 and self.batch_wait_timeout_iters > 0:
+            logger.warning(
+                "Both batch_wait_timeout_ms and batch_wait_timeout_iters are > 0; this may introduce compounded delays. "
+                "Consider enabling only one waiting mechanism.")
         return self

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f6778f2 and 42002a7.

📒 Files selected for processing (5)

tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/unittest/api_stability/references/llm.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unittest/api_stability/references/llm.yaml
tensorrt_llm/_torch/pyexecutor/config.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🧠 Learnings (2)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

🧬 Code graph analysis (2)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)

tests/integration/defs/conftest.py (2)

parametrize_with_ids (1786-1811)

llm_models_root (77-83)

tensorrt_llm/llmapi/llm_args.py (7)

KvCacheConfig (946-1077)

TorchCompileConfig (2108-2155)

CudaGraphConfig (108-165)

MoeConfig (168-196)

MTPDecodingConfig (521-556)

quant_config (2328-2331)

quant_config (2334-2335)

tensorrt_llm/quantization/mode.py (1)

QuantAlgo (23-46)

tests/integration/defs/accuracy/accuracy_core.py (3)

GSM8K (293-308)

evaluate (147-206)

evaluate (707-717)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (282-424)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1279-1279: Line too long (165 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

240-240: Good: initialize batch_wait_iters_count state

The counter is required to enforce iteration-based timeouts.

tensorrt_llm/llmapi/llm_args.py

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

183-195: Batch-wait config: validate/clamp values and compute enable flag after sanitization.

Avoid immediate-release on zero/negative thresholds and out-of-range ratios. Also guard None and zero max_num_tokens.

Apply:

@@
-        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
+        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
@@
-        self.batch_wait_timeout_iters = model_engine.pytorch_backend_config.batch_wait_timeout_iters
-        self.batch_wait_max_tokens_ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio
-        self.enable_batch_waiting = self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0
+        self.batch_wait_timeout_iters = (
+            model_engine.pytorch_backend_config.batch_wait_timeout_iters
+        )
+        self.batch_wait_max_tokens_ratio = (
+            model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio
+        )
+        # Sanitize configuration
+        if self.batch_wait_timeout_iters is None or self.batch_wait_timeout_iters < 0:
+            self.batch_wait_timeout_iters = 0
+        if self.batch_wait_max_tokens_ratio is None:
+            self.batch_wait_max_tokens_ratio = 0.0
+        if not 0.0 <= float(self.batch_wait_max_tokens_ratio) <= 1.0:
+            self.batch_wait_max_tokens_ratio = min(
+                1.0, max(0.0, float(self.batch_wait_max_tokens_ratio))
+            )
+        if self.max_num_tokens is None or self.max_num_tokens < 0:
+            self.max_num_tokens = 0
+        self.enable_batch_waiting = (
+            self.batch_wait_timeout_iters > 0
+            or self.batch_wait_max_tokens_ratio > 0
+        )

Also applies to: 240-240

Add the standard project header per guidelines.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+
 import dataclasses
 import datetime

♻️ Duplicate comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1275-1294: Fix token accounting and zero-threshold logic; remove fragile attribute; wrap long lines.

Current logic overcounts history (len(get_tokens(0))), may access non-existent num_draft_tokens, and always triggers when timeout_iters==0.

Apply:

-    def _waiting_requests(self, context_requests: list[LlmRequest],
-                          generation_requests: list[LlmRequest]):
+    def _waiting_requests(
+        self,
+        context_requests: list[LlmRequest],
+        generation_requests: list[LlmRequest],
+    ):
         if not self.enable_batch_waiting:
             return context_requests
 
         waited_context_requests = []
-        stop_waiting = False
-        num_scheduled_ctx_tokens = sum(
-            len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-        num_scheduled_gen_tokens = sum(1 + gen_req.num_draft_tokens
-                                       for gen_req in generation_requests)
-        num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-        stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        # Per-iteration costs only:
+        num_scheduled_ctx_tokens = sum(
+            req.context_chunk_size for req in context_requests
+        )
+        # Approximate 1 token per generation request this iteration.
+        num_scheduled_gen_tokens = len(generation_requests)
+        num_scheduled_tokens = (
+            num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+        )
+        ratio_threshold = (
+            int(self.batch_wait_max_tokens_ratio * self.max_num_tokens)
+            if self.batch_wait_max_tokens_ratio > 0 and self.max_num_tokens > 0
+            else 0
+        )
+        stop_waiting = (
+            (self.batch_wait_timeout_iters > 0
+             and self.batch_wait_iters_count >= self.batch_wait_timeout_iters)
+            or (self.batch_wait_max_tokens_ratio > 0
+                and num_scheduled_tokens >= ratio_threshold)
+        )
         if stop_waiting:
             waited_context_requests = context_requests
             self.batch_wait_iters_count = 0
         else:
             self.batch_wait_iters_count += 1
         return waited_context_requests

🧹 Nitpick comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1275-1276: Confirm Python 3.8 compatibility for built-in generics in annotations.

list[LlmRequest] requires 3.9+ unless from __future__ import annotations is used. If 3.8 is a hard floor, switch to List[LlmRequest] or add the future import at file top.

1306-1313: Line-wrap and keep condition readable; no functional change.

This meets E501 and clarifies the dead-wait guard.

Apply:

-        # if no generation requests, no need to wait, to avoid dead waiting
-        if not self.enable_attention_dp and self.enable_batch_waiting and len(
-                scheduler_output.context_requests) > 0 and len(
-                    scheduler_output.generation_requests) > 0:
-            scheduled_context_requests = self._waiting_requests(
-                scheduler_output.context_requests,
-                scheduler_output.generation_requests)
+        # If no generation requests, don't wait to avoid dead waiting.
+        if (
+            not self.enable_attention_dp
+            and self.enable_batch_waiting
+            and len(scheduler_output.context_requests) > 0
+            and len(scheduler_output.generation_requests) > 0
+        ):
+            scheduled_context_requests = self._waiting_requests(
+                scheduler_output.context_requests,
+                scheduler_output.generation_requests,
+            )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3790a24 and b0f677e.

📒 Files selected for processing (7)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
tensorrt_llm/llmapi/llm_args.py (3 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
tests/unittest/api_stability/references/llm.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (6)

tests/unittest/api_stability/references/llm.yaml
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tests/integration/test_lists/qa/llm_function_full.txt
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/llmapi/llm_args.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}: Use spaces only; no tabs; indent with 4 spaces
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent Python with 4 spaces; no tabs
Preserve module namespaces when importing: from package.subpackage import foo; then call foo.SomeClass() instead of importing the class directly
Python naming: files snake_case; classes PascalCase; functions/methods snake_case; locals snake_case (prefix k_ when starting with a number); globals UPPER_SNAKE_CASE with G_ prefix; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; limit comments to function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline so Sphinx can render them
Avoid reflection when simpler alternatives exist; prefer explicit parameters and return dicts over locals()/dynamic tricks
In try/except, catch the narrowest exceptions possible; keep try bodies minimal and use else for the main logic when doing duck-typing checks

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

🧠 Learnings (1)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

litaotju · 2025-09-02T13:02:55Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2025-09-02T13:07:55Z

PR_Github #17364 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-02T18:49:23Z

PR_Github #17364 [ run ] completed with state SUCCESS
/LLM/release-1.1.0rc2/L0_MergeRequest_PR pipeline #21 completed with status: 'FAILURE'

litaotju · 2025-09-03T01:41:43Z

#21 failed ones all seems to be infra issue, re-running.

litaotju · 2025-09-03T01:41:59Z

/bot run

tensorrt-cicd · 2025-09-03T01:47:05Z

PR_Github #17442 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-03T06:16:41Z

PR_Github #17442 [ run ] completed with state ABORTED

litaotju · 2025-09-03T13:23:40Z

/bot run

yunruis · 2025-09-04T02:42:01Z

/bot run

tensorrt-cicd · 2025-09-04T02:47:45Z

PR_Github #17604 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T08:28:45Z

PR_Github #17604 [ run ] completed with state SUCCESS
/LLM/release-1.1.0rc2/L0_MergeRequest_PR pipeline #56 completed with status: 'FAILURE'

yunruis · 2025-09-04T08:44:40Z

/bot run

tensorrt-cicd · 2025-09-04T08:50:07Z

PR_Github #17657 [ run ] triggered by Bot

litaotju · 2025-09-04T13:24:13Z

/bot run

tensorrt-cicd · 2025-09-04T13:30:16Z

PR_Github #17690 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T13:30:18Z

PR_Github #17657 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-09-04T22:22:51Z

PR_Github #17690 [ run ] completed with state SUCCESS
/LLM/release-1.1.0rc2/L0_MergeRequest_PR pipeline #67 completed with status: 'SUCCESS'

Signed-off-by: yunruis <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>

yunruis requested review from a team as code owners August 27, 2025 08:21

yunruis requested review from HuiGao-NV and syuoni August 27, 2025 08:21

litaotju requested changes Aug 27, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/config.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/config.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/config.py Outdated Show resolved Hide resolved

litaotju requested a review from Kefeng-Duan August 27, 2025 08:29

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

tensorrt_llm/llmapi/llm_args.py Show resolved Hide resolved

tensorrt_llm/llmapi/llm_args.py Show resolved Hide resolved

yunruis changed the title ~~[None][Opt] opt: add tp attn waiting~~ [None][Opt] add batch waiting on scheduling Aug 27, 2025

yunruis changed the title ~~[None][Opt] add batch waiting on scheduling~~ [None][Opt] Add batch waiting on scheduling Aug 27, 2025

yunruis changed the title ~~[None][Opt] Add batch waiting on scheduling~~ [None][Opt] Add batch waiting when scheduling Aug 27, 2025

yunruis force-pushed the syr/attn_tp_config branch from f26d6e9 to 4e65519 Compare August 27, 2025 13:13

yunruis requested a review from a team as a code owner August 27, 2025 13:13

yunruis requested a review from netanel-haber August 27, 2025 13:13

yunruis changed the title ~~[None][Opt] Add batch waiting when scheduling~~ [None][opt] Add batch waiting when scheduling Aug 27, 2025

yunruis force-pushed the syr/attn_tp_config branch from 4e65519 to f58f5a6 Compare August 27, 2025 13:20

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

tensorrt_llm/llmapi/llm_args.py Show resolved Hide resolved

tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated Show resolved Hide resolved

tests/unittest/api_stability/references/llm.yaml Show resolved Hide resolved

yunruis force-pushed the syr/attn_tp_config branch from f58f5a6 to f6778f2 Compare August 27, 2025 13:52

litaotju approved these changes Aug 27, 2025

View reviewed changes

yunruis force-pushed the syr/attn_tp_config branch from f6778f2 to a4ea57f Compare August 28, 2025 02:09

yunruis force-pushed the syr/attn_tp_config branch from a4ea57f to 42002a7 Compare August 28, 2025 02:13

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

tensorrt_llm/llmapi/llm_args.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

chzblych removed their request for review September 2, 2025 05:28

Merge branch 'release/1.1.0rc2' into syr/attn_tp_config

e5910eb

litaotju added the Release Blocker PRs that blocking the final release build or branching out the release branch label Sep 4, 2025

Merge branch 'release/1.1.0rc2' into syr/attn_tp_config

fe82916

litaotju enabled auto-merge (squash) September 4, 2025 15:47

litaotju disabled auto-merge September 5, 2025 01:35

litaotju merged commit 26fc7da into NVIDIA:release/1.1.0rc2 Sep 5, 2025
5 checks passed

yunruis added a commit to yunruis/TensorRT-LLM that referenced this pull request Sep 22, 2025

[None][opt] Add batch waiting when scheduling (NVIDIA#7287)

cf0230a

Signed-off-by: yunruis <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>

yunruis added a commit to yunruis/TensorRT-LLM that referenced this pull request Sep 22, 2025

[None][opt] Add batch waiting when scheduling (NVIDIA#7287)

dc9a9af

Signed-off-by: yunruis <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>

yunruis added a commit to yunruis/TensorRT-LLM that referenced this pull request Sep 22, 2025

[None][opt] Add batch waiting when scheduling (NVIDIA#7287)

139a761

Signed-off-by: yunruis <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>

yunruis added a commit to yunruis/TensorRT-LLM that referenced this pull request Sep 22, 2025

[None][opt] Add batch waiting when scheduling (NVIDIA#7287)

85f97f7

Signed-off-by: yunruis <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>

yunruis added a commit to yunruis/TensorRT-LLM that referenced this pull request Sep 22, 2025

[None][opt] Add batch waiting when scheduling (NVIDIA#7287)

5cab371

Signed-off-by: yunruis <[email protected]> Co-authored-by: Tao Li @ NVIDIA <[email protected]>

[None][opt] Add batch waiting when scheduling #7287

[None][opt] Add batch waiting when scheduling #7287

Uh oh!

Conversation

yunruis commented Aug 27, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yunruis commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yunruis commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

yunruis commented Aug 28, 2025

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

litaotju commented Sep 2, 2025

Uh oh!

tensorrt-cicd commented Sep 2, 2025

Uh oh!

tensorrt-cicd commented Sep 2, 2025

Uh oh!

litaotju commented Sep 3, 2025

Uh oh!

litaotju commented Sep 3, 2025

Uh oh!

tensorrt-cicd commented Sep 3, 2025

Uh oh!

yunruis commented Aug 27, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 27, 2025 •

edited

Loading