Skip to content

Conversation

Linda-Stadter
Copy link
Collaborator

@Linda-Stadter Linda-Stadter commented Jul 18, 2025

feat: nanobind bindings

This PR includes:

Building with nanobind

The default build command still uses pybind!

Add --binding_type nanobind as a build flag to build_wheel.py to build with nanobind.
Example:
python scripts/build_wheel.py --binding_type nanobind --trt_root /usr/local/tensorrt --cuda_architectures "native" -D FAST_BUILD=ON --use_ccache
This creates a python module bindings. It is the same as before, therefore are no changes in the Python files required.

Porting from pybind to nanobind

I have listed the main differences between pybind11 and nanobind for our code base here. There is also a guide in the nanobind documentation: https://nanobind.readthedocs.io/en/latest/porting.html

  • Simple renamings
pybind11 nanobind
PYBIND11_MODULE(..) NB_MODULE(..)
PYBIND11_OVERRIDE_*(..) NB_OVERRIDE_*(..)
reinterpret_borrow<T>(x) borrow<T>(x)
reinterpret_steal<T>(x) steal<T>(x)
.def_readwrite(..) .def_rw(..)
.def_readonly(..) .def_ro(..)
.def_property(..) .def_prop_rw(..)
.def_property_readonly(..) .def_prop_ro(..)
register_exception<T> exception<T>
classh class_
  • Type caster headers
pybind11 nanobind
Uses just one include <pybind11/stl.h>for all type casters uses specific headers for type casters <nanobind/stl/...>. See documentation
supports type caster for std::deque No support for std::deque. Created a custom caster instead and reached out to owners for support of std:deque
  • Custom constructors
pybind11 nanobind
Could be specified as a lambda function returning an instance of the desired type: .def(py::init([](int) { return MyType(...); })); Changed to in-place construction: .def("__init__", [](MyType *t) { new (t) MyType(...); }); See documentation
  • Replacements
pybind11 nanobind
object.cast<Type>() standalone function: nb::cast<Type>(object)
py::arg_v("param_name", default_value, "doc_string_description") no argument docstring description: nb::arg("param_name") = default_value

Replaced numpy array class py::array with nb::ndarray<>. See documentation
This class does not support nonstandard arithmetic types. I overloaded nanobind::detail::dtype_traits to support datatype half. See documentation

  • Trampoline classes
pybind11 nanobind
PYBIND11_OVERRIDE_*(..) with base type and return value arguments NB_OVERRIDE_*() without these arguments. The class also requires one NB_TRAMPOLINE(parent, size) declaration. See documentation
  • Pickling
pybind11 nanobind
Use py::pickle() to bind __getstate__ and __setstate__ functions expose the __getstate__ and __setstate__ methods. Changed __setstate__ construction to in-place. See documentation

There is a special case in set state function executorConfigSetState: The class ExecutorConfig allows to have dynamic attributes (attributes added in Python code). The set state function has to restore the C++ data as well as the attributes from Python. Nanobind then has to be informed inform the instance object is fully constructed via nb::inst_mark_ready()

  • None arguments
    Nanobind refuses conversions from None to nullptr by default. In order to except None in an argument binding: Either call .none() of nb::arg or specify an nb::none() default arguments. See documentation

  • Custom type casters
    See documentation

pybind11 nanobind
load() from_python() with different signature
cast() from_cpp() with different signature
Created custom caster for at::Tensor and std::deque
  • Exceptions

For the exceptions PeftTaskNotCachedException and LoraCacheFullException, I created custom exceptions with nb::exception<T>. See documentation

Summary by CodeRabbit

  • New Features

    • Added extensive nanobind-based Python bindings covering executor, batch manager, runtime, cache managers, MOE runtime, user buffers, and testing utilities.
    • Introduced Python bindings for model configuration, sampling, decoding, request/response handling, and detailed runtime statistics.
    • Added support for Python subclassing of C++ interfaces and comprehensive serialization support for complex config and request classes.
    • Integrated nanobind build and test configurations into CI pipelines and test lists.
  • Bug Fixes

    • Improved enum parsing for KV cache type using a dedicated from_string method across Python APIs, CLI, and builder code.
    • Fixed tuple indexing in deserialization of extended runtime performance knob config.
    • Adjusted timeout argument handling in cache event manager bindings for consistent units.
  • Tests

    • Added conditional skips for nanobind-unsupported tests in unit test suites.
    • Updated test configurations and bindings tests to accommodate nanobind changes.
  • Chores

    • Refined Python binding APIs for better usability and consistency, including replacing enum constructors with static from_string methods.
    • Enhanced CLI argument parsing with enum-aware type converters.
    • Updated Jenkins build scripts to support nanobind build variants and test jobs.

@Linda-Stadter Linda-Stadter self-assigned this Jul 18, 2025
Copy link
Contributor

coderabbitai bot commented Jul 18, 2025

Walkthrough

This update introduces comprehensive nanobind-based Python bindings for the TensorRT-LLM C++ runtime, executor, batch manager, and related modules, alongside supporting build system and test integration changes. It adds new C++ source and header files for nanobind bindings, modifies build scripts to support nanobind, and adapts Python and test code to use new enum parsing and binding mechanisms.

Changes

File(s) Change Summary
cpp/CMakeLists.txt, jenkins/Build.groovy, jenkins/L0_Test.groovy Build system updated: nanobind build configs, conditional pybind11 inclusion, new Jenkins build/test stages for nanobind.
cpp/tensorrt_llm/nanobind//*.cpp, cpp/tensorrt_llm/nanobind//*.h Added nanobind C++ bindings for runtime, executor, batch manager, user buffers, and related modules.
cpp/tensorrt_llm/nanobind/common/bindTypes.h, cpp/tensorrt_llm/nanobind/common/customCasters.h Utility templates and custom nanobind type casters for STL and TensorRT-LLM types.
cpp/tensorrt_llm/pybind/bindings.cpp, cpp/tensorrt_llm/pybind/executor/bindings.cpp, cpp/tensorrt_llm/pybind/executor/executorConfig.cpp Pybind11 bindings updated to use new enum parsing and method for KVCacheType; minor bug fix in perf knob config deserialization.
tensorrt_llm/builder.py, tensorrt_llm/commands/build.py, tensorrt_llm/runtime/model_runner.py Python code updated to use KVCacheType.from_string for enum parsing and argparse integration.
examples/models/core/llama/summarize_long.py, examples/models/core/qwen2audio/run.py, examples/models/core/qwenvl/run.py Example models updated to use KVCacheType.from_string for enum instantiation.
tests/integration/test_lists/test-db/l0_a10.yml Added nanobind-specific test list for A10 GPU.
tests/unittest/bindings/test_bindings_ut.py, tests/unittest/bindings/test_executor_bindings.py Tests updated: skip tests not supported for nanobind, minor argument and enum usage fixes.

Sequence Diagram(s)

sequenceDiagram
    participant Python
    participant NanobindModule
    participant CppRuntime
    participant CppExecutor
    participant CppBatchManager

    Python->>NanobindModule: import tensorrt_llm.bindings
    Python->>NanobindModule: Create Executor, BatchManager, etc.
    NanobindModule->>CppExecutor: Initialize executor/config via bindings
    NanobindModule->>CppBatchManager: Initialize batch manager via bindings
    Python->>NanobindModule: Enqueue LLM request
    NanobindModule->>CppExecutor: enqueueRequest(...)
    CppExecutor-->>NanobindModule: Return request ID
    Python->>NanobindModule: Await responses
    NanobindModule->>CppExecutor: awaitResponses(...)
    CppExecutor-->>NanobindModule: Response/result objects
    NanobindModule-->>Python: Return Python-wrapped results
Loading

Possibly related PRs

Suggested reviewers

  • dcampora
  • MartinMarciniszyn

Poem

In the warren where code bunnies dwell,
Nanobind bridges cast a spell—
C++ and Python, paw in paw,
Now dance together, without a flaw.
Build scripts hop, tests skip and play,
Enum parsing finds its way.
🐇✨ Boundless bindings, hip hooray!

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@Linda-Stadter
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12313 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 17

🧹 Nitpick comments (10)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)

54-68: Consider adding error handling for callback invocation.

The callback adapter implementation is correct, but consider wrapping the callback invocation in a try-catch block to handle potential Python exceptions gracefully.

 return [callback](RequestIdType reqId, tr::ITensor::SharedPtr& tensor, tb::LlmRequest::BeamTokens const& tokens,
            tr::BufferManager::CudaStreamPtr stream, std::optional<RequestIdType> clientId)
 {
-    at::Tensor atTensor = tr::Torch::tensor(tensor);
-    callback.value()(reqId, atTensor, tokens, runtime::TorchUtils::stream(*stream).unwrap(), clientId);
+    try {
+        at::Tensor atTensor = tr::Torch::tensor(tensor);
+        callback.value()(reqId, atTensor, tokens, runtime::TorchUtils::stream(*stream).unwrap(), clientId);
+    } catch (const std::exception& e) {
+        // Log error or rethrow with context
+        throw std::runtime_error("Error in logits post-processor callback: " + std::string(e.what()));
+    }
 };
cpp/tensorrt_llm/nanobind/common/bindTypes.h (1)

30-60: Consider returning values from pop operations.

The pop_back and pop_front operations don't return values, which differs from Python's list behavior. Consider enhancing these to match Python semantics.

-        .def("pop_back", [](T& lst) { lst.pop_back(); })
+        .def("pop_back", [](T& lst) {
+            if (lst.empty())
+                throw nb::index_error("pop from empty list");
+            auto value = lst.back();
+            lst.pop_back();
+            return value;
+        })
-        .def("pop_front", [](T& lst) { lst.pop_front(); })
+        .def("pop_front", [](T& lst) {
+            if (lst.empty())
+                throw nb::index_error("pop from empty list");
+            auto value = lst.front();
+            lst.pop_front();
+            return value;
+        })
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)

245-256: Consider validating timeout_ms range before conversion

The conversion from double to int64_t could potentially overflow or lose precision for very large timeout values. Consider adding validation to ensure the timeout is within a reasonable range.

 .def(
     "get_latest_events",
     [](tle::KVCacheEventManager& self, std::optional<double> timeout_ms = std::nullopt)
     {
         if (timeout_ms)
         {
+            // Validate timeout range to prevent overflow
+            if (*timeout_ms < 0 || *timeout_ms > static_cast<double>(std::numeric_limits<int64_t>::max()))
+            {
+                throw std::invalid_argument("timeout_ms must be non-negative and within int64_t range");
+            }
             return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms)));
         }
         return self.getLatestEvents(std::nullopt);
     },
     nb::arg("timeout_ms") = std::nullopt);
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)

54-152: Consider refactoring the 49-parameter constructor

The constructor with 49 parameters is extremely complex and could be difficult to maintain. This could lead to:

  • Easy mistakes when calling the constructor
  • Difficulty in adding new parameters
  • Poor readability

Consider using a builder pattern or grouping related parameters into configuration structs. For example:

  • Group sampling-related parameters into a struct
  • Group multimodal-related parameters into a struct
  • Group LoRA-related parameters into a struct

This would make the API more maintainable and less error-prone.

cpp/tensorrt_llm/nanobind/common/customCasters.h (3)

68-80: Consider adding error handling for sequence operations.

The from_python method doesn't verify if the source handle is actually a sequence before creating the sequence object. Consider adding a check to ensure the input is iterable.

 bool from_python(handle src, uint8_t flags, cleanup_list* cleanup) noexcept
 {
+    if (!PySequence_Check(src.ptr()))
+        return false;
     sequence seq(src, nanobind::detail::borrow_t{});
     value.clear();

147-154: Improve error handling in from_cpp method.

The from_cpp method should check if the module import succeeds before using it.

 static handle from_cpp(T const& path, rv_policy, cleanup_list* cleanup)
 {
     if (auto py_str = unicode_from_fs_native(path.native()))
     {
-        return module_::import_("pathlib").attr("Path")(steal<object>(py_str), cleanup).release();
+        auto pathlib = module_::import_("pathlib");
+        if (!pathlib)
+            return nullptr;
+        return pathlib.attr("Path")(steal<object>(py_str)).release();
     }
     return nullptr;
 }

201-207: Consider adding validation for CUDA stream pointer.

The conversion from Python integer to CUDA stream pointer doesn't validate if the pointer is valid. While this matches common CUDA Python binding patterns, consider adding a debug assertion or validation.

 bool from_python([[maybe_unused]] handle src, uint8_t flags, cleanup_list* cleanup)
 {
     auto stream_ptr = nanobind::cast<uintptr_t>(src);
+    // Optionally validate the stream pointer in debug builds
+    #ifdef DEBUG
+    if (stream_ptr != 0) {
+        cudaStream_t stream = reinterpret_cast<cudaStream_t>(stream_ptr);
+        // Could add cudaStreamQuery to validate if needed
+    }
+    #endif
     value = std::make_shared<tensorrt_llm::runtime::CudaStream>(reinterpret_cast<cudaStream_t>(stream_ptr));
 
     return true;
 }
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)

253-331: Consider documenting the tensor dimension requirements.

The constructor correctly handles tensor conversions with special unsqueezing for specific tensors. Consider adding documentation about which tensors require unsqueezing and why.

Add a comment explaining the unsqueeze behavior:

                auto makeOptionalTensor = [](std::optional<at::Tensor> const& atTensor, bool unsqueeze = false)
                {
+                   // Some tensors (embedding_bias, bad_words_list, stop_words_list) require an extra batch dimension
                    std::optional<tb::LlmRequest::TensorPtr> tensorPtr = std::nullopt;

445-523: Verify empty request handling.

The function correctly handles context and generation requests with beam search support. However, consider adding validation for empty request vectors.

# Check if the function handles empty contextRequests or genRequests gracefully
# This could lead to issues if activeSlots remains empty

Add validation at the beginning:

+        if (contextRequests.empty() && genRequests.empty())
+        {
+            throw std::runtime_error("Both context and generation request vectors are empty");
+        }
+
         std::vector<int> activeSlots;
cpp/tensorrt_llm/nanobind/bindings.cpp (1)

80-80: Consider documenting why leak warnings are disabled.

Disabling leak warnings might hide legitimate memory leaks. Consider adding a comment explaining why this is necessary, or enable it in debug builds to catch potential issues during development.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 44040ed and d013a3a.

📒 Files selected for processing (49)
  • cpp/CMakeLists.txt (2 hunks)
  • cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/buffers.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1 hunks)
  • examples/models/core/llama/summarize_long.py (1 hunks)
  • examples/models/core/qwen2audio/run.py (1 hunks)
  • examples/models/core/qwenvl/run.py (1 hunks)
  • jenkins/Build.groovy (4 hunks)
  • jenkins/L0_Test.groovy (3 hunks)
  • tensorrt_llm/builder.py (1 hunks)
  • tensorrt_llm/commands/build.py (2 hunks)
  • tensorrt_llm/runtime/model_runner.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/unittest/bindings/test_bindings_ut.py (4 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (10 hunks)
🧰 Additional context used
🧠 Learnings (5)
tests/integration/test_lists/test-db/l0_a10.yml (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executor.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (13)
cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (2)
cpp/include/tensorrt_llm/batch_manager/transformerBuffers.h (1)
  • TransformerBuffers (40-143)
cpp/tensorrt_llm/batch_manager/transformerBuffers.cpp (1)
  • TransformerBuffers (41-142)
cpp/tensorrt_llm/pybind/bindings.cpp (1)
cpp/include/tensorrt_llm/runtime/modelConfig.h (1)
  • KVCacheTypeFromString (76-94)
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (2)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)
  • tensorrt_llm (30-158)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
  • initBindings (55-523)
  • initBindings (55-55)
cpp/tensorrt_llm/nanobind/executor/bindings.h (8)
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1)
  • tensorrt_llm (24-29)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)
  • tensorrt_llm (23-28)
cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1)
  • tensorrt_llm (24-30)
cpp/tensorrt_llm/nanobind/executor/executor.h (1)
  • tensorrt_llm (27-127)
cpp/tensorrt_llm/nanobind/executor/request.h (1)
  • tensorrt_llm (23-29)
cpp/tensorrt_llm/nanobind/runtime/bindings.h (1)
  • tensorrt_llm (24-30)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)
  • initBindings (48-261)
  • initBindings (48-48)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
  • initBindings (200-239)
  • initBindings (200-200)
cpp/tensorrt_llm/nanobind/executor/request.h (2)
cpp/tensorrt_llm/nanobind/executor/bindings.h (1)
  • tensorrt_llm (23-29)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
  • initRequestBindings (53-933)
  • initRequestBindings (53-53)
examples/models/core/qwenvl/run.py (1)
tensorrt_llm/runtime/generation.py (1)
  • kv_cache_type (1169-1170)
examples/models/core/qwen2audio/run.py (1)
tensorrt_llm/runtime/generation.py (1)
  • kv_cache_type (1169-1170)
cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (5)
cpp/tensorrt_llm/nanobind/batch_manager/buffers.h (1)
  • tensorrt_llm (22-29)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (2)
  • tensorrt_llm (23-30)
  • tensorrt_llm (32-39)
cpp/tensorrt_llm/nanobind/executor/executor.h (1)
  • tensorrt_llm (27-127)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
  • initBindings (55-523)
  • initBindings (55-55)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)
  • initBindings (48-261)
  • initBindings (48-48)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)
  • from_torch (54-61)
  • from_torch (54-54)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • mNumReturnSequences (1973-1973)
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (2)
cpp/include/tensorrt_llm/runtime/modelConfig.h (3)
  • useGptAttentionPlugin (293-296)
  • setKVCacheType (586-589)
  • useLoraPlugin (540-543)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
  • setDraftTokens (1092-1095)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
  • shutdown (189-198)
  • shutdown (189-189)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
tensorrt_llm/llmapi/llm_args.py (3)
  • SchedulerConfig (614-632)
  • PeftCacheConfig (636-699)
  • ExtendedRuntimePerfKnobConfig (847-874)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (3)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (17)
  • addNewToken (684-691)
  • addNewTokens (696-707)
  • setGeneratedTokens (729-746)
  • pause (786-831)
  • setMaxSentTokenLen (844-847)
  • setDraftLogits (1097-1100)
  • setLoraWeights (909-912)
  • setDecodingIter (1666-1669)
  • setLogProbs (1020-1024)
  • setReturnEncoderOutput (1141-1144)
  • setCumLogProb (1031-1034)
  • setFinishedReason (1661-1664)
  • allocContextLogitsHost (1311-1315)
  • setIsDummyRequest (1830-1833)
  • finishByReason (1760-1780)
  • setFirstScheduledTime (1238-1244)
  • updatePerfMetrics (1801-1818)
tests/unittest/bindings/test_executor_bindings.py (1)
  • logits_post_processor (1826-1832)
tensorrt_llm/functional.py (1)
  • slice (1222-1324)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (71)
examples/models/core/qwenvl/run.py (1)

121-122: LGTM: Standardized KVCacheType instantiation

The change from direct constructor KVCacheType(config["build_config"]["kv_cache_type"]) to KVCacheType.from_string(config["build_config"]["kv_cache_type"]) aligns with the codebase-wide standardization effort. This approach is more explicit and consistent with the string-based parsing pattern used throughout the project.

cpp/CMakeLists.txt (2)

201-204: LGTM: Extended pybind11 support for nanobind transition

The addition of OR BUILD_DEEP_EP condition appropriately extends pybind11 support beyond just the "pybind" binding type. This enables the build system to include pybind11 dependencies when needed for the nanobind migration, ensuring compatibility during the transition period.


220-222: LGTM: Consistent pybind11 include directory handling

The condition extension matches the earlier change at line 201, ensuring consistent inclusion of pybind11 headers when either BINDING_TYPE is "pybind" or BUILD_DEEP_EP is enabled. This maintains build system consistency.

cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (1)

171-171: Ownership change from unique_ptr to shared_ptr is applied correctly

Verified that all uses of transformerBuffers now uniformly employ std::shared_ptr with proper nullptr checks (e.g. if (transformerBuffers)), that the instance is created via std::make_shared and will be cleaned up automatically, and that shared_ptr’s atomic reference counting ensures no leaks. No code paths remain using the old pointer type. Note that while shared_ptr’s ref-counting is thread-safe, the TransformerBuffers class itself remains non-synchronized—ensure you don’t concurrently mutate the same instance across threads without external locking.

examples/models/core/llama/summarize_long.py (1)

100-100: LGTM: Consistent KVCacheType instantiation pattern

The change to use KVCacheType.from_string(build_config['kv_cache_type']) instead of the direct constructor follows the same standardization pattern seen across the codebase. This improves consistency and makes the string-based parsing explicit.

cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1)

1-30: LGTM: Clean nanobind algorithms header structure

The header file follows good C++ practices with:

  • Proper copyright header and license information
  • Include guards using #pragma once
  • Clean namespace organization matching the module structure
  • Simple, focused interface with a single initBindings function declaration
  • Consistent with the nanobind binding pattern used throughout the codebase
tensorrt_llm/commands/build.py (2)

41-56: Well-designed enum parsing helper function.

The enum_type function provides robust enum parsing with proper error handling and informative error messages. It correctly handles both enum instances and string representations via the from_string method, which aligns with the broader enum standardization efforts across the codebase.


151-151: Consistent enum parsing implementation.

The usage of enum_type(KVCacheType) is consistent with the broader pattern of standardizing enum construction using the from_string method across the codebase.

cpp/tensorrt_llm/nanobind/executor/bindings.h (1)

1-30: Consistent nanobind binding header implementation.

The header follows the established pattern used throughout the project for nanobind binding declarations. The function signature and namespace usage are consistent with other binding headers like batch_manager/bindings.h and runtime/bindings.h.

cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1)

1-31: Consistent executor configuration binding header.

The header follows the same established pattern as other nanobind binding headers in the project. The function name initConfigBindings clearly differentiates it from the general initBindings function, maintaining good separation of concerns.

tests/integration/test_lists/test-db/l0_a10.yml (1)

193-207: Appropriate nanobind test configuration addition.

The new l0_a10_nanobind test configuration is well-structured and consistent with existing test conditions. It appropriately targets the unittest/bindings test suite for nanobind testing on A10 GPUs, which aligns with the PR's objective of enabling nanobind in CI.

examples/models/core/qwen2audio/run.py (1)

125-126: Consistent enum construction standardization.

The change to use KVCacheType.from_string() aligns with the broader enum standardization effort across the codebase. This provides more robust string-to-enum conversion and is consistent with similar changes in other files like tensorrt_llm/builder.py and tensorrt_llm/runtime/model_runner.py.

cpp/tensorrt_llm/nanobind/batch_manager/buffers.h (1)

1-30: LGTM! Well-structured nanobind bindings header.

This header file follows the established patterns for nanobind binding headers in the codebase. The structure is clean with proper licensing, include guards, namespace organization, and a simple class declaration for binding initialization.

tensorrt_llm/runtime/model_runner.py (1)

89-89: LGTM! Consistent enum initialization pattern.

The change from direct constructor usage to the from_string method aligns with the standardization effort across the codebase for KVCacheType enum initialization. This provides more explicit and consistent enum parsing.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)

1-29: LGTM! Clean and consistent nanobind bindings header.

This header follows the established patterns for nanobind binding headers with proper licensing, include guards, namespace organization, and a simple function declaration for binding initialization. The structure is consistent with other binding headers in the codebase.

cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp (1)

87-88: LGTM! Ownership change supports nanobind bindings.

The change from std::make_unique to std::make_shared for transformerBuffers is appropriate for enabling shared ownership between the C++ runtime and Python bindings. This aligns with the broader nanobind integration effort and follows good practices for exposing C++ objects to Python.

cpp/tensorrt_llm/nanobind/executor/request.h (1)

1-30: LGTM! Well-organized executor request bindings header.

This header follows the established patterns for nanobind binding headers with proper licensing, include guards, and namespace organization. The initRequestBindings function declaration fits well into the overall executor binding framework and provides a clean interface for initializing request-related Python bindings.

cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1)

1-30: LGTM! Header follows established nanobind binding patterns.

The header file is well-structured and consistent with other nanobind binding headers in the codebase. The use of #pragma once, proper namespace organization, and the standard initBindings function declaration all follow established patterns.

cpp/tensorrt_llm/nanobind/runtime/bindings.h (1)

1-31: LGTM! Well-structured header with early initialization support.

The header file follows the established nanobind binding patterns and includes both initBindings and initBindingsEarly functions. The early initialization pattern is commonly used in binding scenarios where certain types need to be registered before dependent types.

tensorrt_llm/builder.py (1)

596-596: LGTM! Standardizes enum construction using the from_string method.

This change aligns with the broader migration to nanobind and standardizes enum parsing across the codebase. Using KVCacheType.from_string() is more explicit about string parsing and provides better error handling compared to direct construction.

cpp/tensorrt_llm/nanobind/CMakeLists.txt (5)

6-22: LGTM! Comprehensive source file additions for nanobind bindings.

The addition of source files across batch_manager, executor, runtime, testing, and userbuffers components properly supports the extensive nanobind bindings being introduced. The organization by component is logical and maintainable.


33-36: LGTM! Appropriate conditional linking for NVSHMEM.

The conditional linking of NVSHMEM libraries only when enabled is a good practice that avoids unnecessary dependencies.


40-46: LGTM! Addition of CUDA NVML library is appropriate.

The addition of ${CUDA_NVML_LIB} to the target link libraries is appropriate for GPU operations and monitoring capabilities.


56-56: LGTM! Extended rpath flags for CUDA toolkit stubs.

The addition of the CUDA toolkit stubs directory to the rpath is appropriate for proper linking and runtime library resolution.


49-49: Ensure the correct detailed‐error macro for nanobind

I didn’t find any references to either NB_DETAILED_ERROR_MESSAGES or PYBIND11_DETAILED_ERROR_MESSAGES in the repo. Since this target is built with nanobind_add_module, please confirm whether nanobind supports the PYBIND11_DETAILED_ERROR_MESSAGES flag or if it should instead use the nanobind‐specific NB_DETAILED_ERROR_MESSAGES.

• File: cpp/tensorrt_llm/nanobind/CMakeLists.txt (around line 49)

cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1)

1-31: LGTM! Header follows established nanobind binding patterns.

The header file is well-structured and consistent with other nanobind binding headers in the codebase. The use of the UserBufferBindings class with a static initBindings method follows the established pattern used throughout the nanobind bindings.

cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1)

1-30: LGTM! Well-structured header file.

The header file follows best practices with proper license headers, include guards, and a clean API design. The function declaration is appropriately placed in the tensorrt_llm::nanobind::runtime namespace and follows the established pattern for nanobind binding initialization.

cpp/tensorrt_llm/pybind/bindings.cpp (1)

173-173: LGTM! Aligns enum construction with nanobind patterns.

The change from constructor-based enum creation to a static from_string method creates consistency between pybind11 and nanobind binding styles, as mentioned in the PR objectives. This uniform API approach will facilitate the migration and provide a consistent interface for users.

cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1)

1-30: LGTM! Consistent header file structure.

The header file follows the established pattern for nanobind binding headers with proper license headers, include guards, and a clean API design. The CacheTransceiverBindings class is appropriately placed in the tensorrt_llm::batch_manager namespace and follows the binding initialization pattern.

cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)

339-339: LGTM! Critical bug fix for deserialization.

The change from state[2] to state[3] correctly fixes the tuple element indexing for the fourth parameter in the ExtendedRuntimePerfKnobConfig deserialization. This aligns with the expected tuple size of 4 elements and ensures proper restoration of the SizeType32 parameter during unpickling.

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1)

1-40: LGTM! Well-organized header file with proper namespace separation.

The header file provides a clean API for both KV cache manager and PEFT cache manager bindings. The namespace separation between tensorrt_llm::batch_manager::kv_cache_manager and tensorrt_llm::batch_manager appropriately reflects the different functional areas while maintaining the established pattern for nanobind binding initialization.

cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)

247-257: LGTM! Interface improvement with clearer timeout semantics.

The lambda wrapper approach provides better clarity by explicitly naming the parameter timeout_ms to indicate the expected unit. The conversion from double to std::chrono::milliseconds is correctly implemented with proper handling of the optional parameter.

tests/unittest/bindings/test_bindings_ut.py (3)

313-314: LGTM! Appropriate test skip for partial nanobind migration.

The conditional skip is correctly implemented and aligns with the PR objectives indicating that some tests are not yet supported in the nanobind migration. The clear reason message helps maintainers understand why the test is skipped.


424-425: LGTM! Consistent test skip approach.

The skip decorator follows the same pattern as other nanobind-related test skips, maintaining consistency across the test suite during the migration period.


505-506: LGTM! Proper conditional test skip.

The skip condition is correctly implemented for the nanobind migration, ensuring test stability while the migration is completed.

jenkins/L0_Test.groovy (4)

67-68: LGTM - Clean addition of nanobind configuration constant.

The new NANOBIND_CONFIG field follows the existing pattern for other build configurations.


77-77: LGTM - Consistent addition to build configurations map.

The nanobind tarball entry follows the established naming convention and structure.


1731-1731: LGTM - Appropriate test configuration for nanobind.

The A10-Nanobind test configuration uses the correct test list l0_a10_nanobind and follows the existing pattern for single GPU tests.


1808-1810: LGTM - Clean logic for nanobind configuration detection.

The conditional logic properly detects "Nanobind" in stage names and assigns the appropriate configuration, following the same pattern as other build configurations.

cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (1)

30-46: LGTM - Clean nanobind binding implementation.

The binding structure follows nanobind best practices with proper namespace usage and function definitions. The use of lambda functions for type conversions is appropriate.

cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (3)

35-44: LGTM - Well-structured enum bindings with clear documentation.

The enum bindings properly use nb::is_arithmetic() and include descriptive strings for each value, which improves the Python API usability.


46-85: LGTM - Comprehensive ModelSpec bindings with proper memory management.

The class bindings are well-structured with:

  • Proper constructor binding
  • Consistent use of nb::rv_policy::reference_internal for methods returning references
  • Clear method naming that follows Python conventions
  • Proper copy constructor implementation

The return value policy ensures that objects returned by methods maintain proper lifetime relationships with the parent object.


69-69: Note: Inconsistent return value policy.

The use_logits method doesn't specify a return value policy while most other similar methods use nb::rv_policy::reference_internal. This might be intentional if it returns by value rather than reference.

Consider verifying that this is the intended behavior based on the actual method signature.

tests/unittest/bindings/test_executor_bindings.py (3)

17-17: LGTM!

The import is necessary for checking the binding type to conditionally skip tests.


488-489: Expected test skips for partial nanobind migration.

These test skips are consistent with the PR's partial migration status. The skip decorator is correctly applied with a clear reason.

Also applies to: 694-695, 1120-1121, 1159-1160, 1507-1508, 1881-1882, 2157-2158


2481-2483: Good improvement: Using enum instead of string.

The change from string "UCX" to enum trtllm.CacheTransceiverBackendType.UCX improves type safety and API clarity.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)

43-51: LGTM!

The utility function correctly handles optional tensor conversion, consistent with the pattern used in kvCacheManager.cpp.


70-131: Document why returnPerfMetrics is hardcoded to false.

The implementation correctly converts all fields, but returnPerfMetrics is hardcoded to false on line 125. If performance metrics aren't supported in the nanobind bindings yet, please add a comment explaining this limitation.

-        false,                                                     // returnPerfMetrics
+        false,                                                     // returnPerfMetrics - TODO: Not yet supported in nanobind
cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp (3)

43-78: Well-structured TransformerBuffers bindings.

The bindings comprehensively expose the TransformerBuffers interface with proper named arguments and read-write properties.


88-91: Good handling of shared_ptr property.

The custom getter/setter for transformerBuffers properly handles the shared_ptr ownership semantics.


102-106: Members Are Publicly Declared – No Action Needed

The fields mCacheIndirDecoderIOBatchedCopySrcOffsetsSliceDevice, mCacheIndirDecoderIOBatchedCopyDstOffsetsSliceDevice, and mCacheIndirDecoderIOBatchedCopyCopySizesDevice are defined above the private: marker in runtimeBuffers.h, so they are already public and can safely be exposed via Python bindings.

Likely an incorrect or invalid review comment.

cpp/tensorrt_llm/nanobind/common/bindTypes.h (1)

63-98: Excellent set binding implementation with pickling support.

The set binding template is comprehensive, including proper pickling support for serialization. The implementation correctly handles all standard set operations and follows nanobind best practices.

jenkins/Build.groovy (1)

50-55: LGTM! Nanobind build configurations properly integrated

The new Nanobind build configurations follow the existing pattern and are properly integrated into the build pipeline. The configurations correctly specify:

  • The --binding_type nanobind flag
  • Appropriate tarball names with "nanobind-" prefix
  • Matching wheel architectures for each platform

Also applies to: 65-69, 85-89, 542-543

cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1)

55-178: Well-structured algorithm bindings

The nanobind bindings for batch manager algorithms are properly implemented:

  • Consistent pattern across all algorithm classes
  • Appropriate use of TorchView for tensor conversions
  • Proper handling of optional arguments with default values
  • Clear argument naming for Python API
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)

89-150: Complex initialization list is correctly implemented

The initialization list properly handles the conversion of optional vectors to shared pointers, which is necessary for the base class. The use of ternary operators and move semantics is appropriate for efficiency.

cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (2)

33-47: Good input validation in helper functions

The helper functions pyDoReplication and pyDoPlacement properly validate that the expert count matches the size of the expert load factor vector before calling the runtime functions. This prevents potential segmentation faults or undefined behavior.


49-122: Well-implemented MOE bindings

The nanobind bindings for MOE components are properly structured:

  • Clear class and method documentation strings
  • Appropriate use of def_rw for read-write properties
  • Nice __repr__ implementation for MoeWeight for debugging
  • Consistent argument naming across all methods
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)

73-107: LGTM! Well-structured serialization support.

The __getstate__ and __setstate__ implementation properly handles all 19 fields of SamplingConfig with appropriate type conversions and validation.

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (3)

63-237: LGTM! Comprehensive trampoline implementation.

The PyKvCacheManager trampoline class correctly implements all 28 pure virtual methods from BaseKVCacheManager using the appropriate NB_OVERRIDE_PURE macro.


378-385: Good error handling pattern for tensor conversions.

The tensor conversion with validation using TLLM_CHECK_WITH_INFO provides clear error messages. This pattern is consistently applied throughout the file.


456-464: Excellent GIL management for performance.

The ensure_batch method correctly releases the GIL before calling the potentially expensive C++ operation, allowing Python concurrency.

cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)

144-162: Well-implemented managed weights handling.

The constructor properly converts Python dictionary of numpy arrays to C++ map of Tensors with appropriate null checking and type conversions.


189-198: Excellent GIL management and documentation in shutdown.

The shutdown method correctly releases the GIL to prevent deadlock, with clear explanation of why this is necessary. The trace logging aids debugging.

cpp/tensorrt_llm/nanobind/executor/executor.h (2)

61-86: Consistent and correct GIL handling in all await methods.

All three awaitResponses overloads correctly release the GIL with clear documentation explaining the rationale. The [[nodiscard]] attributes ensure callers handle return values.


30-129: Well-designed executor wrapper interface.

The Executor class provides a clean Python-friendly interface with:

  • Multiple construction options
  • Proper context manager support
  • Consistent error handling patterns
  • Thread-safe operations with appropriate GIL management
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)

36-76: LGTM! Trampoline class correctly implements pure virtual methods.

The PyCacheTransceiver trampoline class is properly structured with NB_TRAMPOLINE declaration matching the 6 pure virtual methods. All overrides use NB_OVERRIDE_PURE correctly.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)

52-75: Pickling implementation is correct.

The DynamicBatchConfig pickling properly validates tuple size and correctly reconstructs the object.


444-456: Well-implemented enum string conversion.

The from_string method properly handles case-insensitive conversion and provides clear error messages.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

59-81: Exception registration implemented correctly.

The custom exception types are properly registered and the translator correctly maps C++ exceptions to Python errors.

cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (2)

61-97: Trampoline class correctly implements the IGptDecoder interface.

All 5 pure virtual methods are properly overridden using NB_OVERRIDE_PURE.


124-145: Constructor overloads properly handle both file and buffer inputs.

The buffer constructor correctly validates the array dimension and both constructors appropriately use the default logger.

@Linda-Stadter
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12315 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12313 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (8)
cpp/tensorrt_llm/nanobind/common/customCasters.h (3)

104-120: Potential lifetime issue with OptionalRef construction.

Creating an OptionalRef with a reference to the local conv object is unsafe. The conv object will be destroyed when the function returns, leaving a dangling reference.


312-319: Add error handling for DLPack operations.

The from_python method should handle potential failures in DLPack conversion.


302-303: Use std::const_pointer_cast instead of reinterpret_cast for SharedConstPtr conversion

The current reinterpret_cast is undefined behavior. Replace it with std::const_pointer_cast to safely strip the const qualifier.

cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)

91-97: Add validation for num_kv_heads_per_layer vector size.

The constructor accepts a vector of KV heads per layer but doesn't validate that its size matches the expected number of layers from the cache manager. This could lead to out-of-bounds access or incorrect cache layouts.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

584-584: Parameter name inconsistency with property accessors.

The constructor parameter mm_embedding_offloading doesn't match the property accessor names prompt_table_offloading. This inconsistency could confuse users.

Also applies to: 631-632

cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (2)

321-329: Add validation for PyTorch stream object.

The function assumes py_stream has a cuda_stream attribute without validation, which could cause crashes if an incorrect object is passed.


307-319: Add null pointer validation for buffer addresses.

The lamport functions cast integer addresses to pointers without validation, which could cause segmentation faults if 0 is passed.

cpp/tensorrt_llm/nanobind/bindings.cpp (1)

363-389: Replace assert with proper error handling.

The assert at line 365 will be disabled in release builds, potentially leading to undefined behavior if the tuple size is incorrect. Replace it with a proper runtime check that throws an exception.

     auto SamplingConfigSetState = [](tr::SamplingConfig& self, nb::tuple t) -> tr::SamplingConfig
     {
-        assert(t.size() == 19);
+        if (t.size() != 19) {
+            throw std::runtime_error("Invalid state tuple size for SamplingConfig: expected 19, got " + std::to_string(t.size()));
+        }
 
         tr::SamplingConfig config;
🧹 Nitpick comments (3)
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1)

100-130: Consider releasing GIL for potentially expensive operations

The HandleContextLogits and HandleGenerationLogits operations might be computationally expensive. Consider releasing the GIL to allow other Python threads to run during these operations.

Example for HandleContextLogits:

             [](HandleContextLogits const& self, DecoderInputBuffers& inputBuffers, RequestVector const& contextRequests,
                 at::Tensor const& logits, std::vector<tr::SizeType32> const& numContextLogitsVec,
                 tr::ModelConfig const& modelConfig, tr::BufferManager const& manager,
                 OptionalRef<MedusaBuffers> medusaBuffers = std::nullopt)
             {
+                nb::gil_scoped_release release;
                 return self(inputBuffers, contextRequests, tr::TorchView::of(logits), numContextLogitsVec, modelConfig,
                     manager, medusaBuffers);
             },
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)

253-376: Consider using a builder pattern for the complex constructor.

The LlmRequest constructor has 49 parameters, which makes it difficult to use and maintain. While the implementation is correct, consider introducing a builder pattern or configuration object to improve the API usability.


435-439: Use size_t instead of int for loop counter.

The loop iterates over a vector but uses int as the counter type. Use size_t to match the vector's size type and avoid potential issues with large vectors.

-            for (int i = 0; i < requests.size(); ++i)
+            for (size_t i = 0; i < requests.size(); ++i)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d013a3a and 8613975.

📒 Files selected for processing (49)
  • cpp/CMakeLists.txt (2 hunks)
  • cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (1 hunks)
  • cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/buffers.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1 hunks)
  • examples/models/core/llama/summarize_long.py (1 hunks)
  • examples/models/core/qwen2audio/run.py (1 hunks)
  • examples/models/core/qwenvl/run.py (1 hunks)
  • jenkins/Build.groovy (4 hunks)
  • jenkins/L0_Test.groovy (3 hunks)
  • tensorrt_llm/builder.py (1 hunks)
  • tensorrt_llm/commands/build.py (2 hunks)
  • tensorrt_llm/runtime/model_runner.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/unittest/bindings/test_bindings_ut.py (4 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (10 hunks)
✅ Files skipped from review due to trivial changes (3)
  • cpp/CMakeLists.txt
  • tests/integration/test_lists/test-db/l0_a10.yml
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
🚧 Files skipped from review as they are similar to previous changes (36)
  • cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h
  • examples/models/core/llama/summarize_long.py
  • examples/models/core/qwen2audio/run.py
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
  • examples/models/core/qwenvl/run.py
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
  • tensorrt_llm/commands/build.py
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt
  • cpp/tensorrt_llm/nanobind/executor/bindings.h
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h
  • cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp
  • cpp/tensorrt_llm/pybind/bindings.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/buffers.h
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • cpp/tensorrt_llm/nanobind/executor/request.h
  • tensorrt_llm/runtime/model_runner.py
  • tensorrt_llm/builder.py
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
  • tests/unittest/bindings/test_executor_bindings.py
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp
  • tests/unittest/bindings/test_bindings_ut.py
  • jenkins/L0_Test.groovy
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h
  • jenkins/Build.groovy
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/nanobind/executor/request.cpp
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp
  • cpp/tensorrt_llm/nanobind/executor/executor.h
🧰 Additional context used
🧠 Learnings (3)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/common/customCasters.h (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
  • shutdown (189-198)
  • shutdown (189-189)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (20)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

458-464: Good practice: GIL release for potentially blocking operation

Great job releasing the GIL in ensure_batch. This allows other Python threads to run while the potentially expensive batch ensuring operation executes.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

468-536: Well-implemented pickling for ExecutorConfig.

The pickling implementation properly handles both C++ state and Python dynamic attributes, and correctly uses nb::inst_mark_ready() to finalize the object after reconstruction. This is a robust solution for complex objects with dynamic attributes.

cpp/tensorrt_llm/nanobind/bindings.cpp (18)

1-61: Well-organized header section with comprehensive includes.

The header section is well-structured with:

  • Updated copyright notice
  • Comprehensive nanobind includes for STL containers and operators
  • Proper inclusion of torch and TensorRT-LLM components
  • Clean namespace aliases that improve readability
  • Helpful template type aliases

68-75: Clean utility function implementation.

The makeSamplingConfig helper function is properly placed in an anonymous namespace and provides a simple wrapper for the SamplingConfig constructor.


76-111: Comprehensive MPI communication bindings.

The MpiComm bindings are well-implemented with:

  • Proper static method bindings for singleton access
  • Complete coverage of rank, size, and splitting operations
  • Appropriate lambda wrappers for accessing MPI session instances

112-122: Proper CUDA stream binding implementation.

The CudaStream binding correctly handles Python-to-CUDA stream conversion using nanobind's in-place construction pattern and proper type casting.


123-132: Well-organized submodule hierarchy.

The submodule structure is logical with clear separation of concerns:

  • Executor bindings in dedicated submodule
  • Internal components grouped appropriately
  • Early initialization calls for dependency management

133-135: Simple build information exposure.

Clean exposure of build-time configuration to Python through a dedicated submodule.


136-157: Comprehensive PEFT cache manager configuration binding.

The PeftCacheManagerConfig binding is well-implemented with:

  • Complete constructor parameter binding with appropriate defaults
  • Read-write properties for all configuration fields
  • Proper handling of optional parameters with std::nullopt

158-208: Comprehensive enum bindings with string conversion support.

The enum bindings are well-implemented with:

  • Complete coverage of all enum values
  • Proper value exports for direct access
  • KVCacheType includes a useful from_string static method for configuration parsing

209-224: Well-designed LoraModule binding.

The LoraModule binding follows good patterns with:

  • Comprehensive constructor parameter binding
  • Read-only properties appropriate for configuration data
  • Static factory method for creating LoRA modules

225-268: Comprehensive quantization mode binding.

The QuantMode binding is excellent with:

  • Complete static factory methods for all quantization types
  • Comprehensive property bindings for feature detection
  • Proper operator overloads for flag-like behavior
  • Multiple factory methods supporting different use cases

269-324: Comprehensive model configuration binding.

The ModelConfig binding is very well-implemented with:

  • Complete constructor and property bindings
  • Appropriate mix of read-only and read-write properties
  • Proper handling of overloaded methods with different signatures
  • Good coverage of all model configuration aspects

325-354: Comprehensive world configuration binding.

The WorldConfig binding is well-designed with:

  • Complete constructor with appropriate defaults
  • Comprehensive read-only properties for world state
  • Static MPI factory method for distributed setup
  • Proper handling of optional device IDs

391-418: Comprehensive sampling configuration binding.

The SamplingConfig binding is well-implemented with:

  • Multiple constructors for different use cases
  • Complete read-write properties for all sampling parameters
  • Proper pickle support with serialization methods
  • Equality operator for comparisons

419-422: Standard utility bindings.

Clean vector binding and utility function exposure for SamplingConfig operations.


423-449: Comprehensive JSON configuration binding.

The GptJsonConfig binding is well-implemented with:

  • Complete constructor and property bindings
  • Static factory methods for parsing JSON from strings and files
  • Overloaded methods for different engine filename generation scenarios
  • Appropriate read-only properties for configuration data

450-470: Comprehensive state tracking and memory monitoring bindings.

Good implementation with:

  • Complete LlmRequestState enum covering all request states
  • Proper singleton access for MemoryCounters with reference return policy
  • Read-only properties for different memory counter types

471-484: Well-organized binding initialization sequence.

The initialization sequence is properly structured with:

  • Logical order of component initialization
  • Appropriate submodule creation for specialized components
  • Complete coverage of all binding components

485-497: Proper NVLS memory handling bindings.

The NVLS bindings are well-implemented with:

  • Complete IpcNvlsHandle class binding with necessary properties
  • Custom method for accessing IPC pointers for C interface compatibility
  • Module-level functions for NVLS allocation and management operations

@Linda-Stadter
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (10)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)

110-110: GIL handling issue in shutdown method

As noted in the past review comment, the MoeLoadBalancer::shutdown() method should release the GIL to prevent potential deadlocks if the load balancer has spawned threads that need to acquire the GIL during callbacks.

cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)

246-256: Missing timeout validation

As noted in the past review comment, the timeout conversion from double to milliseconds doesn't validate negative values, which could cause unexpected behavior.

cpp/tensorrt_llm/nanobind/executor/request.cpp (3)

913-930: Const-correctness violation in clear methods

As noted in the past review comment, the clear_context_logits and clear_generation_logits methods use const_cast to modify a const reference, violating const-correctness principles.


580-581: Hard-coded parameter value in deserialization

As noted in the past review comment, the hard-coded value 1 for numReturnSequences in the Request constructor during deserialization should be properly handled.


433-450: Critical control flow bug in setstate

As noted in the past review comment, the ContextPhaseParamsSetState function has incorrect control flow, executing both constructor calls instead of using proper conditional logic.

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

409-417: Add validation for negative timeout values

Similar to the executor bindings, the timeout conversion should validate negative values to prevent invalid timeout durations.

             [](tbk::BaseKVCacheManager& self, std::optional<double> timeout_ms = std::nullopt)
             {
                 if (timeout_ms)
                 {
+                    if (*timeout_ms < 0)
+                    {
+                        throw std::invalid_argument("timeout_ms must be non-negative");
+                    }
                     return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms)));
                 }
                 return self.getLatestEvents(std::nullopt);
             },
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3)

584-584: Parameter name inconsistency with property accessors

The constructor parameter is named mm_embedding_offloading but the property accessors use prompt_table_offloading. This inconsistency could confuse users.

-            nb::arg("gather_generation_logits") = false, nb::arg("mm_embedding_offloading") = false,
+            nb::arg("gather_generation_logits") = false, nb::arg("prompt_table_offloading") = false,

631-632: Fix property name inconsistency

The property accessor uses mm_embedding_offloading but should match the underlying method names that use prompt_table_offloading.

-        .def_prop_rw("mm_embedding_offloading", &tle::ExecutorConfig::getPromptTableOffloading,
-            &tle::ExecutorConfig::setPromptTableOffloading)
+        .def_prop_rw("prompt_table_offloading", &tle::ExecutorConfig::getPromptTableOffloading,
+            &tle::ExecutorConfig::setPromptTableOffloading)

352-353: Fix the index error in deserialization

There's a critical bug where state[2] is used twice. The fourth parameter should use state[3].

-        new (&self) tle::ExtendedRuntimePerfKnobConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]),
-            nb::cast<bool>(state[2]), nb::cast<SizeType32>(state[2]));
+        new (&self) tle::ExtendedRuntimePerfKnobConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]),
+            nb::cast<bool>(state[2]), nb::cast<SizeType32>(state[3]));
cpp/tensorrt_llm/nanobind/bindings.cpp (1)

362-388: Replace assert with proper error handling.

The assert at line 364 will be disabled in release builds, potentially leading to undefined behavior if the tuple size is incorrect. Replace it with a proper runtime check that throws an exception.

Apply this diff to fix the issue:

-        assert(t.size() == 19);
+        if (t.size() != 19) {
+            throw std::runtime_error("Invalid state tuple size for SamplingConfig: expected 19, got " + std::to_string(t.size()));
+        }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

378-406: Add validation for tensor arguments in batch operations

The copy_batch_block_offsets method should validate that requestIds is not empty to prevent unnecessary iterations and potential issues.

             [](tbk::BaseKVCacheManager& self, at::Tensor output,
                 std::vector<tb::LlmRequest::RequestIdType> const& requestIds, SizeType32 const beamWidth,
                 SizeType32 const offset)
             {
                 auto _output = from_torch(output);
                 TLLM_CHECK_WITH_INFO(_output.has_value(), "Invalid output tensor.");
+                TLLM_CHECK_WITH_INFO(!requestIds.empty(), "requestIds cannot be empty.");
                 for (size_t i = 0; i < requestIds.size(); ++i)
                 {
                     self.copyBlockOffsets(*(_output.value()), i * beamWidth + offset, requestIds[i]);
                 }
             })
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

433-439: Add boundary checking for vector access

The loop should validate that the index is within bounds before accessing the requests vector.

             TLLM_CHECK_WITH_INFO(requests.size() == tokens.size(), "Expected the same number of requests and tokens.");

-            for (int i = 0; i < requests.size(); ++i)
+            for (size_t i = 0; i < requests.size(); ++i)
             {
+                TLLM_CHECK_WITH_INFO(i < requests.size(), "Index out of bounds.");
                 requests[i]->addNewToken(tokens[i], beam_idx);
             }
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

492-496: Add validation for state tuple size

The hardcoded size check should be made more maintainable by using a named constant.

+        static constexpr size_t EXPECTED_CPP_STATES_SIZE = 28;
         auto cpp_states = nb::cast<nb::tuple>(state[0]);
-        if (cpp_states.size() != 28)
+        if (cpp_states.size() != EXPECTED_CPP_STATES_SIZE)
         {
             throw std::runtime_error("Invalid cpp_states!");
         }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8613975 and 4ff44cf.

📒 Files selected for processing (45)
  • cpp/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1 hunks)
  • examples/models/core/llama/summarize_long.py (1 hunks)
  • examples/models/core/qwen2audio/run.py (1 hunks)
  • examples/models/core/qwenvl/run.py (1 hunks)
  • jenkins/Build.groovy (4 hunks)
  • jenkins/L0_Test.groovy (3 hunks)
  • tensorrt_llm/builder.py (1 hunks)
  • tensorrt_llm/commands/build.py (2 hunks)
  • tensorrt_llm/runtime/model_runner.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/unittest/bindings/test_bindings_ut.py (4 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (10 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/test_lists/test-db/l0_a10.yml
🚧 Files skipped from review as they are similar to previous changes (36)
  • examples/models/core/llama/summarize_long.py
  • tensorrt_llm/runtime/model_runner.py
  • examples/models/core/qwenvl/run.py
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • examples/models/core/qwen2audio/run.py
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h
  • cpp/CMakeLists.txt
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h
  • cpp/tensorrt_llm/pybind/bindings.cpp
  • cpp/tensorrt_llm/nanobind/executor/request.h
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/nanobind/executor/bindings.h
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
  • tests/unittest/bindings/test_executor_bindings.py
  • tensorrt_llm/builder.py
  • tensorrt_llm/commands/build.py
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
  • tests/unittest/bindings/test_bindings_ut.py
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
  • jenkins/Build.groovy
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
  • cpp/tensorrt_llm/nanobind/common/customCasters.h
  • cpp/tensorrt_llm/nanobind/executor/executor.h
  • jenkins/L0_Test.groovy
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp
  • cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
🧰 Additional context used
🧠 Learnings (2)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
  • shutdown (189-198)
  • shutdown (189-189)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
tensorrt_llm/llmapi/llm_args.py (3)
  • SchedulerConfig (614-632)
  • PeftCacheConfig (636-699)
  • ExtendedRuntimePerfKnobConfig (847-874)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (30)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (2)

33-39: Input validation is appropriate

The helper function correctly validates that the expert count matches the expert load factor size before proceeding with the operation.


41-47: Consistent validation pattern

The pyDoPlacement function follows the same validation pattern as pyDoReplication, ensuring consistent error handling.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (3)

43-50: Tensor conversion function is well-implemented

The from_torch helper function correctly handles optional tensor conversion using TorchView, with proper null checking.


54-68: Callback adapter handles tensor conversion properly

The callback adapter correctly converts internal tensor representations to PyTorch tensors for the user-provided callback, with proper stream handling.


70-131: Complex constructor with many parameters

The toTrtLlm method constructs a request with 49 parameters, which is complex but appears necessary for the comprehensive LLM request configuration. The parameter ordering and types are consistent with the header declaration.

cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)

56-88: Well-implemented custom serialization for DecodingMode

The custom __getstate__ and __setstate__ methods properly handle the DecodingMode serialization with appropriate state validation.


105-115: Comprehensive statistics class bindings

The KvCacheStats class binding properly exposes all relevant statistics fields for monitoring cache performance.

cpp/tensorrt_llm/nanobind/executor/request.cpp (2)

73-107: Comprehensive SamplingConfig serialization

The SamplingConfig serialization properly handles all 19 parameters with appropriate type casting and validation.


585-662: Extensive Request constructor with proper defaults

The Request constructor binding handles a comprehensive set of parameters with appropriate default values and keyword-only arguments for better usability.

cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)

63-237: Well-structured trampoline class implementation

The PyKvCacheManager trampoline class correctly implements all 28 virtual methods with NB_OVERRIDE_PURE macros, enabling proper Python subclassing of the abstract BaseKVCacheManager class.


54-61: Proper tensor conversion handling

The from_torch helper function correctly handles optional tensor conversion from PyTorch to runtime tensors with proper null checking.

cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)

59-81: Proper exception registration and handling

The exception registration for PeftTaskNotCachedException and LoraCacheFullException is correctly implemented with proper translator setup.


429-523: Well-structured batch processing functions

The free functions add_new_tokens_to_requests and make_decoding_batch_input are well-implemented with proper error checking and efficient tensor operations.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)

468-536: Complex but well-structured serialization logic

The ExecutorConfig serialization/deserialization implementation correctly handles both C++ state and Python attributes using nb::inst_mark_ready(), enabling proper object lifecycle management.

cpp/tensorrt_llm/nanobind/bindings.cpp (16)

1-61: LGTM - Well-organized headers and namespace declarations.

The includes are comprehensive and properly organized, with appropriate namespace aliases for readability.


67-73: LGTM - Clean helper function implementation.

The anonymous namespace usage is appropriate for the file-local helper function.


82-109: LGTM - Comprehensive MpiComm bindings.

The static method bindings are properly implemented with appropriate lambda wrappers.


111-120: LGTM - Appropriate CudaStream binding.

The custom constructor properly handles the conversion from Python stream objects to CUDA stream handles.


122-133: LGTM - Well-organized submodule structure.

The hierarchical submodule organization and early initialization calls are appropriate.


135-155: LGTM - Comprehensive PeftCacheManagerConfig bindings.

The binding properly exposes all configuration options with appropriate defaults and optional parameter handling.


157-206: LGTM - Comprehensive enum bindings.

The enum bindings properly expose all values, and the from_string method for KVCacheType is a useful addition for parsing.


208-222: LGTM - Well-structured LoraModule bindings.

The read-only properties and static factory method are appropriately designed for this configuration class.


224-266: LGTM - Comprehensive QuantMode bindings.

The extensive static factory methods and operator overloads are appropriate for this quantization configuration class.


268-322: LGTM - Extensive ModelConfig bindings.

The comprehensive property exposure and proper handling of method overloads with nb::overload_cast are well-implemented.


324-352: LGTM - Well-structured WorldConfig bindings.

The read-only properties and static MPI factory method are appropriately designed for this configuration class.


390-420: LGTM - Comprehensive SamplingConfig bindings.

The class bindings, vector binding, and helper function are well-implemented.


422-447: LGTM - Well-structured GptJsonConfig bindings.

The static parsing methods and read-only properties are appropriately designed for this configuration class.


449-468: LGTM - Appropriate enum and singleton bindings.

The LlmRequestState enum and MemoryCounters singleton binding are well-implemented.


470-481: LGTM - Well-organized submodule initialization.

The initialization calls are properly organized and follow a logical pattern.


483-495: LGTM - Appropriate NVLS bindings.

The IpcNvlsHandle bindings and global functions are well-implemented for NVLS memory management.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12317 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12315 [ run ] completed with state ABORTED

@DomBrown DomBrown self-requested a review July 18, 2025 13:58
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12317 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9151 completed with status: 'SUCCESS'

This reverts commit b75e53a.

Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
@Linda-Stadter
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12382 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)

246-256: Add validation for negative timeout values

The timeout conversion from double to milliseconds doesn't validate negative values. Negative timeouts could cause unexpected behavior.

             [](tle::KVCacheEventManager& self, std::optional<double> timeout_ms = std::nullopt)
             {
                 if (timeout_ms)
                 {
+                    if (*timeout_ms < 0)
+                    {
+                        throw std::invalid_argument("timeout_ms must be non-negative");
+                    }
                     return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms)));
                 }
                 return self.getLatestEvents(std::nullopt);
             },
cpp/tensorrt_llm/nanobind/common/customCasters.h (3)

104-120: Potential lifetime issue with OptionalRef construction

On line 118, creating an OptionalRef with a reference to the local conv object is unsafe. The conv object will be destroyed when the function returns, leaving a dangling reference.

This appears to be a fundamental design issue with how OptionalRef is being used. The caster should ensure the referenced object has appropriate lifetime guarantees. Consider if OptionalRef is the right type to use here, or if the binding needs to manage object lifetime differently.


302-303: Use std::const_pointer_cast instead of reinterpret_cast for SharedConstPtr conversion

The current reinterpret_cast of ITensor::SharedConstPtr to ITensor::SharedPtr is undefined behavior—even though both are std::shared_ptr instantiations, they are distinct types. Replace it with std::const_pointer_cast to safely strip the const qualifier while preserving the control block.

-        return THPVariable_Wrap(tensorrt_llm::runtime::Torch::tensor(
-            reinterpret_cast<tensorrt_llm::runtime::ITensor::SharedPtr const&>(src)));
+        return THPVariable_Wrap(tensorrt_llm::runtime::Torch::tensor(
+            std::const_pointer_cast<tensorrt_llm::runtime::ITensor>(src)));

312-319: Add error handling for DLPack operations

The from_python method should handle potential failures in DLPack conversion.

 bool from_python(nb::handle src, uint8_t, cleanup_list*) noexcept
 {
-    nb::object capsule = nb::getattr(src, "__dlpack__")();
-    DLManagedTensor* dl_managed = static_cast<DLManagedTensor*>(PyCapsule_GetPointer(capsule.ptr(), "dltensor"));
-    PyCapsule_SetDestructor(capsule.ptr(), nullptr);
-    value = at::fromDLPack(dl_managed).alias();
-    return true;
+    try {
+        nb::object capsule = nb::getattr(src, "__dlpack__")();
+        if (!PyCapsule_CheckExact(capsule.ptr()))
+            return false;
+        DLManagedTensor* dl_managed = static_cast<DLManagedTensor*>(PyCapsule_GetPointer(capsule.ptr(), "dltensor"));
+        if (!dl_managed)
+            return false;
+        PyCapsule_SetDestructor(capsule.ptr(), nullptr);
+        value = at::fromDLPack(dl_managed).alias();
+        return true;
+    } catch (...) {
+        return false;
+    }
 }
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)

409-417: Add validation for negative timeout values

Similar to the executor bindings, the timeout conversion should validate negative values.

             [](tbk::BaseKVCacheManager& self, std::optional<double> timeout_ms = std::nullopt)
             {
                 if (timeout_ms)
                 {
+                    if (*timeout_ms < 0)
+                    {
+                        throw std::invalid_argument("timeout_ms must be non-negative");
+                    }
                     return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms)));
                 }
                 return self.getLatestEvents(std::nullopt);
             },
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

254-376: Constructor complexity needs to be addressed.

The 49-parameter constructor remains overly complex and error-prone. Consider implementing a builder pattern as previously suggested to improve maintainability and usability.

cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)

352-354: Fix the deserialization index error.

The fourth parameter incorrectly uses state[2] instead of state[3], causing incorrect deserialization.


584-584: Parameter naming inconsistency.

The constructor parameter mm_embedding_offloading is inconsistent with the property name prompt_table_offloading, which could confuse users.

Also applies to: 631-632

cpp/tensorrt_llm/nanobind/bindings.cpp (1)

362-389: Replace assert with proper runtime error handling.

The assert will be disabled in release builds, potentially causing undefined behavior if the tuple size is incorrect.

🧹 Nitpick comments (4)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)

73-74: Simplify draft tokens handling

The creation of optDraftTokens seems redundant since mDraftTokens is always initialized (either with provided tokens or an empty vector according to the constructor). Consider simplifying this.

-    auto const draftTokens = std::make_shared<std::vector<TokenIdType>>(*mDraftTokens.get());
-    auto const optDraftTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(draftTokens);
+    auto const optDraftTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(mDraftTokens);

75-78: Consider avoiding unnecessary vector copy for encoder tokens

The current implementation creates a copy of the encoder tokens vector. Since the tokens are already managed by a shared_ptr in the member variable, consider reusing it directly.

-    auto const encoderInputTokens = mEncoderTokens.has_value()
-        ? std::make_shared<std::vector<TokenIdType>>(*mEncoderTokens.value().get())
-        : nullptr;
-    auto const optEncoderInputTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(encoderInputTokens);
+    auto const optEncoderInputTokens = mEncoderTokens;
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (2)

54-152: Consider design patterns to manage constructor complexity.

The constructor with 49 parameters is extremely complex and potentially error-prone. While this may be constrained by the underlying GenericLlmRequest API, consider these approaches:

  1. Builder pattern: Create a builder class to construct requests incrementally
  2. Parameter object pattern: Group related parameters into configuration structs
  3. Factory methods: Provide static factory methods for common use cases

The conditional shared_ptr creation logic is repetitive and could benefit from helper functions.

Example helper function to reduce repetition:

template<typename T>
static auto makeOptionalShared(std::optional<T>&& opt) {
    return opt.has_value() 
        ? std::make_optional(std::make_shared<T>(std::move(*opt)))
        : std::optional<std::shared_ptr<T>>(std::nullopt);
}

39-158: Overall design is appropriate for a binding layer with room for improvement.

The LlmRequest class effectively adapts the underlying GenericLlmRequest for nanobind Python bindings. Key strengths:

  • Proper integration with PyTorch tensors (at::Tensor)
  • Clear documentation of stream handling limitations
  • Appropriate adapter pattern implementation

Areas for potential improvement:

  • Constructor complexity could benefit from builder or factory patterns
  • Consider adding validation for parameter combinations
  • Enhanced documentation for the extensive parameter set would help users

The design appropriately handles the complexity inherited from the underlying C++ API while providing a clean Python interface.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ff44cf and 53e389a.

📒 Files selected for processing (45)
  • cpp/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt (2 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/bindings.cpp (2 hunks)
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/common/customCasters.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executor.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/executor/request.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1 hunks)
  • examples/models/core/llama/summarize_long.py (1 hunks)
  • examples/models/core/qwen2audio/run.py (1 hunks)
  • examples/models/core/qwenvl/run.py (1 hunks)
  • jenkins/Build.groovy (4 hunks)
  • jenkins/L0_Test.groovy (3 hunks)
  • tensorrt_llm/builder.py (1 hunks)
  • tensorrt_llm/commands/build.py (2 hunks)
  • tensorrt_llm/runtime/model_runner.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_a10.yml (1 hunks)
  • tests/unittest/bindings/test_bindings_ut.py (4 hunks)
  • tests/unittest/bindings/test_executor_bindings.py (10 hunks)
✅ Files skipped from review due to trivial changes (2)
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
  • cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
🚧 Files skipped from review as they are similar to previous changes (35)
  • examples/models/core/qwenvl/run.py
  • tensorrt_llm/commands/build.py
  • tensorrt_llm/runtime/model_runner.py
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
  • examples/models/core/qwen2audio/run.py
  • cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt
  • cpp/tensorrt_llm/nanobind/executor/bindings.h
  • cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
  • cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
  • tests/integration/test_lists/test-db/l0_a10.yml
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
  • cpp/CMakeLists.txt
  • cpp/tensorrt_llm/nanobind/executor/executorConfig.h
  • cpp/tensorrt_llm/pybind/executor/bindings.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/nanobind/runtime/bindings.h
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
  • cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/nanobind/executor/request.h
  • tensorrt_llm/builder.py
  • cpp/tensorrt_llm/pybind/bindings.cpp
  • cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
  • tests/unittest/bindings/test_executor_bindings.py
  • examples/models/core/llama/summarize_long.py
  • tests/unittest/bindings/test_bindings_ut.py
  • jenkins/Build.groovy
  • cpp/tensorrt_llm/nanobind/common/bindTypes.h
  • cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
  • jenkins/L0_Test.groovy
  • cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
  • cpp/tensorrt_llm/nanobind/executor/request.cpp
  • cpp/tensorrt_llm/nanobind/executor/executor.cpp
  • cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/nanobind/executor/executor.h
🧰 Additional context used
🧠 Learnings (2)
cpp/tensorrt_llm/nanobind/common/customCasters.h (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (3)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
  • initRequestBindings (53-933)
  • initRequestBindings (53-53)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
  • initConfigBindings (46-637)
  • initConfigBindings (46-46)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
  • initBindings (200-239)
  • initBindings (200-200)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
  • sinkTokenLength (47-47)
tensorrt_llm/llmapi/llm_args.py (3)
  • SchedulerConfig (614-632)
  • PeftCacheConfig (636-699)
  • ExtendedRuntimePerfKnobConfig (847-874)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)

59-80: Exception handling is well-implemented.

The exception registration and translation mechanism properly converts C++ exceptions to Python exceptions, ensuring clean error propagation across the language boundary.

cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (3)

18-33: Header structure and includes look good.

The includes are well-organized and appropriate for the nanobind binding functionality. The namespace aliases (nb for nanobind and tb for tensorrt_llm::batch_manager) improve readability.


35-53: Well-designed class with helpful documentation.

The inheritance from GenericLlmRequest<at::Tensor, c10::Stream> is appropriate, and the comment about torch's nanobind binding limitations with c10::cuda::CUDAStream provides valuable context. The type aliases improve code readability.


154-158: Method declarations are well-designed.

The callbackAdapter and toTrtLlm methods serve clear purposes in the binding layer:

  • callbackAdapter handles Python-to-C++ callback conversion
  • toTrtLlm provides type conversion to the underlying batch manager
  • Appropriate const-correctness and return types

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12382 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9201 completed with status: 'FAILURE'

@Linda-Stadter
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12389 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12389 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9208 completed with status: 'SUCCESS'

@DomBrown DomBrown merged commit 3efad2e into NVIDIA:main Jul 21, 2025
3 checks passed
timlee0212 pushed a commit to timlee0212/TensorRT-LLM that referenced this pull request Jul 21, 2025
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Shreyas Misra <[email protected]>
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Ransiki Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants