-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: nanobind bindings #6185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: nanobind bindings #6185
Conversation
WalkthroughThis update introduces comprehensive nanobind-based Python bindings for the TensorRT-LLM C++ runtime, executor, batch manager, and related modules, alongside supporting build system and test integration changes. It adds new C++ source and header files for nanobind bindings, modifies build scripts to support nanobind, and adapts Python and test code to use new enum parsing and binding mechanisms. Changes
Sequence Diagram(s)sequenceDiagram
participant Python
participant NanobindModule
participant CppRuntime
participant CppExecutor
participant CppBatchManager
Python->>NanobindModule: import tensorrt_llm.bindings
Python->>NanobindModule: Create Executor, BatchManager, etc.
NanobindModule->>CppExecutor: Initialize executor/config via bindings
NanobindModule->>CppBatchManager: Initialize batch manager via bindings
Python->>NanobindModule: Enqueue LLM request
NanobindModule->>CppExecutor: enqueueRequest(...)
CppExecutor-->>NanobindModule: Return request ID
Python->>NanobindModule: Await responses
NanobindModule->>CppExecutor: awaitResponses(...)
CppExecutor-->>NanobindModule: Response/result objects
NanobindModule-->>Python: Return Python-wrapped results
Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
/bot run |
PR_Github #12313 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
🧹 Nitpick comments (10)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (1)
54-68
: Consider adding error handling for callback invocation.The callback adapter implementation is correct, but consider wrapping the callback invocation in a try-catch block to handle potential Python exceptions gracefully.
return [callback](RequestIdType reqId, tr::ITensor::SharedPtr& tensor, tb::LlmRequest::BeamTokens const& tokens, tr::BufferManager::CudaStreamPtr stream, std::optional<RequestIdType> clientId) { - at::Tensor atTensor = tr::Torch::tensor(tensor); - callback.value()(reqId, atTensor, tokens, runtime::TorchUtils::stream(*stream).unwrap(), clientId); + try { + at::Tensor atTensor = tr::Torch::tensor(tensor); + callback.value()(reqId, atTensor, tokens, runtime::TorchUtils::stream(*stream).unwrap(), clientId); + } catch (const std::exception& e) { + // Log error or rethrow with context + throw std::runtime_error("Error in logits post-processor callback: " + std::string(e.what())); + } };cpp/tensorrt_llm/nanobind/common/bindTypes.h (1)
30-60
: Consider returning values from pop operations.The
pop_back
andpop_front
operations don't return values, which differs from Python's list behavior. Consider enhancing these to match Python semantics.- .def("pop_back", [](T& lst) { lst.pop_back(); }) + .def("pop_back", [](T& lst) { + if (lst.empty()) + throw nb::index_error("pop from empty list"); + auto value = lst.back(); + lst.pop_back(); + return value; + }) - .def("pop_front", [](T& lst) { lst.pop_front(); }) + .def("pop_front", [](T& lst) { + if (lst.empty()) + throw nb::index_error("pop from empty list"); + auto value = lst.front(); + lst.pop_front(); + return value; + })cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)
245-256
: Consider validating timeout_ms range before conversionThe conversion from double to int64_t could potentially overflow or lose precision for very large timeout values. Consider adding validation to ensure the timeout is within a reasonable range.
.def( "get_latest_events", [](tle::KVCacheEventManager& self, std::optional<double> timeout_ms = std::nullopt) { if (timeout_ms) { + // Validate timeout range to prevent overflow + if (*timeout_ms < 0 || *timeout_ms > static_cast<double>(std::numeric_limits<int64_t>::max())) + { + throw std::invalid_argument("timeout_ms must be non-negative and within int64_t range"); + } return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms))); } return self.getLatestEvents(std::nullopt); }, nb::arg("timeout_ms") = std::nullopt);cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)
54-152
: Consider refactoring the 49-parameter constructorThe constructor with 49 parameters is extremely complex and could be difficult to maintain. This could lead to:
- Easy mistakes when calling the constructor
- Difficulty in adding new parameters
- Poor readability
Consider using a builder pattern or grouping related parameters into configuration structs. For example:
- Group sampling-related parameters into a struct
- Group multimodal-related parameters into a struct
- Group LoRA-related parameters into a struct
This would make the API more maintainable and less error-prone.
cpp/tensorrt_llm/nanobind/common/customCasters.h (3)
68-80
: Consider adding error handling for sequence operations.The
from_python
method doesn't verify if the source handle is actually a sequence before creating thesequence
object. Consider adding a check to ensure the input is iterable.bool from_python(handle src, uint8_t flags, cleanup_list* cleanup) noexcept { + if (!PySequence_Check(src.ptr())) + return false; sequence seq(src, nanobind::detail::borrow_t{}); value.clear();
147-154
: Improve error handling in from_cpp method.The
from_cpp
method should check if the module import succeeds before using it.static handle from_cpp(T const& path, rv_policy, cleanup_list* cleanup) { if (auto py_str = unicode_from_fs_native(path.native())) { - return module_::import_("pathlib").attr("Path")(steal<object>(py_str), cleanup).release(); + auto pathlib = module_::import_("pathlib"); + if (!pathlib) + return nullptr; + return pathlib.attr("Path")(steal<object>(py_str)).release(); } return nullptr; }
201-207
: Consider adding validation for CUDA stream pointer.The conversion from Python integer to CUDA stream pointer doesn't validate if the pointer is valid. While this matches common CUDA Python binding patterns, consider adding a debug assertion or validation.
bool from_python([[maybe_unused]] handle src, uint8_t flags, cleanup_list* cleanup) { auto stream_ptr = nanobind::cast<uintptr_t>(src); + // Optionally validate the stream pointer in debug builds + #ifdef DEBUG + if (stream_ptr != 0) { + cudaStream_t stream = reinterpret_cast<cudaStream_t>(stream_ptr); + // Could add cudaStreamQuery to validate if needed + } + #endif value = std::make_shared<tensorrt_llm::runtime::CudaStream>(reinterpret_cast<cudaStream_t>(stream_ptr)); return true; }cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
253-331
: Consider documenting the tensor dimension requirements.The constructor correctly handles tensor conversions with special unsqueezing for specific tensors. Consider adding documentation about which tensors require unsqueezing and why.
Add a comment explaining the unsqueeze behavior:
auto makeOptionalTensor = [](std::optional<at::Tensor> const& atTensor, bool unsqueeze = false) { + // Some tensors (embedding_bias, bad_words_list, stop_words_list) require an extra batch dimension std::optional<tb::LlmRequest::TensorPtr> tensorPtr = std::nullopt;
445-523
: Verify empty request handling.The function correctly handles context and generation requests with beam search support. However, consider adding validation for empty request vectors.
# Check if the function handles empty contextRequests or genRequests gracefully # This could lead to issues if activeSlots remains emptyAdd validation at the beginning:
+ if (contextRequests.empty() && genRequests.empty()) + { + throw std::runtime_error("Both context and generation request vectors are empty"); + } + std::vector<int> activeSlots;cpp/tensorrt_llm/nanobind/bindings.cpp (1)
80-80
: Consider documenting why leak warnings are disabled.Disabling leak warnings might hide legitimate memory leaks. Consider adding a comment explaining why this is necessary, or enable it in debug builds to catch potential issues during development.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (49)
cpp/CMakeLists.txt
(2 hunks)cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h
(1 hunks)cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/buffers.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h
(1 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
(1 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(1 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/run.py
(1 hunks)examples/models/core/qwenvl/run.py
(1 hunks)jenkins/Build.groovy
(4 hunks)jenkins/L0_Test.groovy
(3 hunks)tensorrt_llm/builder.py
(1 hunks)tensorrt_llm/commands/build.py
(2 hunks)tensorrt_llm/runtime/model_runner.py
(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/bindings/test_bindings_ut.py
(4 hunks)tests/unittest/bindings/test_executor_bindings.py
(10 hunks)
🧰 Additional context used
🧠 Learnings (5)
tests/integration/test_lists/test-db/l0_a10.yml (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executor.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (13)
cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (2)
cpp/include/tensorrt_llm/batch_manager/transformerBuffers.h (1)
TransformerBuffers
(40-143)cpp/tensorrt_llm/batch_manager/transformerBuffers.cpp (1)
TransformerBuffers
(41-142)
cpp/tensorrt_llm/pybind/bindings.cpp (1)
cpp/include/tensorrt_llm/runtime/modelConfig.h (1)
KVCacheTypeFromString
(76-94)
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (2)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)
tensorrt_llm
(30-158)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
initBindings
(55-523)initBindings
(55-55)
cpp/tensorrt_llm/nanobind/executor/bindings.h (8)
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1)
tensorrt_llm
(24-29)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)
tensorrt_llm
(23-28)cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1)
tensorrt_llm
(24-30)cpp/tensorrt_llm/nanobind/executor/executor.h (1)
tensorrt_llm
(27-127)cpp/tensorrt_llm/nanobind/executor/request.h (1)
tensorrt_llm
(23-29)cpp/tensorrt_llm/nanobind/runtime/bindings.h (1)
tensorrt_llm
(24-30)cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)
initBindings
(48-261)initBindings
(48-48)cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
initBindings
(200-239)initBindings
(200-200)
cpp/tensorrt_llm/nanobind/executor/request.h (2)
cpp/tensorrt_llm/nanobind/executor/bindings.h (1)
tensorrt_llm
(23-29)cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
initRequestBindings
(53-933)initRequestBindings
(53-53)
examples/models/core/qwenvl/run.py (1)
tensorrt_llm/runtime/generation.py (1)
kv_cache_type
(1169-1170)
examples/models/core/qwen2audio/run.py (1)
tensorrt_llm/runtime/generation.py (1)
kv_cache_type
(1169-1170)
cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (5)
cpp/tensorrt_llm/nanobind/batch_manager/buffers.h (1)
tensorrt_llm
(22-29)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (2)
tensorrt_llm
(23-30)tensorrt_llm
(32-39)cpp/tensorrt_llm/nanobind/executor/executor.h (1)
tensorrt_llm
(27-127)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
initBindings
(55-523)initBindings
(55-55)cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)
initBindings
(48-261)initBindings
(48-48)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)
from_torch
(54-61)from_torch
(54-54)cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
mNumReturnSequences
(1973-1973)
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (2)
cpp/include/tensorrt_llm/runtime/modelConfig.h (3)
useGptAttentionPlugin
(293-296)setKVCacheType
(586-589)useLoraPlugin
(540-543)cpp/include/tensorrt_llm/batch_manager/llmRequest.h (1)
setDraftTokens
(1092-1095)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
shutdown
(189-198)shutdown
(189-189)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)tensorrt_llm/llmapi/llm_args.py (3)
SchedulerConfig
(614-632)PeftCacheConfig
(636-699)ExtendedRuntimePerfKnobConfig
(847-874)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (3)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (17)
addNewToken
(684-691)addNewTokens
(696-707)setGeneratedTokens
(729-746)pause
(786-831)setMaxSentTokenLen
(844-847)setDraftLogits
(1097-1100)setLoraWeights
(909-912)setDecodingIter
(1666-1669)setLogProbs
(1020-1024)setReturnEncoderOutput
(1141-1144)setCumLogProb
(1031-1034)setFinishedReason
(1661-1664)allocContextLogitsHost
(1311-1315)setIsDummyRequest
(1830-1833)finishByReason
(1760-1780)setFirstScheduledTime
(1238-1244)updatePerfMetrics
(1801-1818)tests/unittest/bindings/test_executor_bindings.py (1)
logits_post_processor
(1826-1832)tensorrt_llm/functional.py (1)
slice
(1222-1324)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (71)
examples/models/core/qwenvl/run.py (1)
121-122
: LGTM: Standardized KVCacheType instantiationThe change from direct constructor
KVCacheType(config["build_config"]["kv_cache_type"])
toKVCacheType.from_string(config["build_config"]["kv_cache_type"])
aligns with the codebase-wide standardization effort. This approach is more explicit and consistent with the string-based parsing pattern used throughout the project.cpp/CMakeLists.txt (2)
201-204
: LGTM: Extended pybind11 support for nanobind transitionThe addition of
OR BUILD_DEEP_EP
condition appropriately extends pybind11 support beyond just the"pybind"
binding type. This enables the build system to include pybind11 dependencies when needed for the nanobind migration, ensuring compatibility during the transition period.
220-222
: LGTM: Consistent pybind11 include directory handlingThe condition extension matches the earlier change at line 201, ensuring consistent inclusion of pybind11 headers when either
BINDING_TYPE
is "pybind" orBUILD_DEEP_EP
is enabled. This maintains build system consistency.cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h (1)
171-171
: Ownership change from unique_ptr to shared_ptr is applied correctlyVerified that all uses of
transformerBuffers
now uniformly employstd::shared_ptr
with proper nullptr checks (e.g.if (transformerBuffers)
), that the instance is created viastd::make_shared
and will be cleaned up automatically, and thatshared_ptr
’s atomic reference counting ensures no leaks. No code paths remain using the old pointer type. Note that whileshared_ptr
’s ref-counting is thread-safe, theTransformerBuffers
class itself remains non-synchronized—ensure you don’t concurrently mutate the same instance across threads without external locking.examples/models/core/llama/summarize_long.py (1)
100-100
: LGTM: Consistent KVCacheType instantiation patternThe change to use
KVCacheType.from_string(build_config['kv_cache_type'])
instead of the direct constructor follows the same standardization pattern seen across the codebase. This improves consistency and makes the string-based parsing explicit.cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h (1)
1-30
: LGTM: Clean nanobind algorithms header structureThe header file follows good C++ practices with:
- Proper copyright header and license information
- Include guards using
#pragma once
- Clean namespace organization matching the module structure
- Simple, focused interface with a single
initBindings
function declaration- Consistent with the nanobind binding pattern used throughout the codebase
tensorrt_llm/commands/build.py (2)
41-56
: Well-designed enum parsing helper function.The
enum_type
function provides robust enum parsing with proper error handling and informative error messages. It correctly handles both enum instances and string representations via thefrom_string
method, which aligns with the broader enum standardization efforts across the codebase.
151-151
: Consistent enum parsing implementation.The usage of
enum_type(KVCacheType)
is consistent with the broader pattern of standardizing enum construction using thefrom_string
method across the codebase.cpp/tensorrt_llm/nanobind/executor/bindings.h (1)
1-30
: Consistent nanobind binding header implementation.The header follows the established pattern used throughout the project for nanobind binding declarations. The function signature and namespace usage are consistent with other binding headers like
batch_manager/bindings.h
andruntime/bindings.h
.cpp/tensorrt_llm/nanobind/executor/executorConfig.h (1)
1-31
: Consistent executor configuration binding header.The header follows the same established pattern as other nanobind binding headers in the project. The function name
initConfigBindings
clearly differentiates it from the generalinitBindings
function, maintaining good separation of concerns.tests/integration/test_lists/test-db/l0_a10.yml (1)
193-207
: Appropriate nanobind test configuration addition.The new
l0_a10_nanobind
test configuration is well-structured and consistent with existing test conditions. It appropriately targets theunittest/bindings
test suite for nanobind testing on A10 GPUs, which aligns with the PR's objective of enabling nanobind in CI.examples/models/core/qwen2audio/run.py (1)
125-126
: Consistent enum construction standardization.The change to use
KVCacheType.from_string()
aligns with the broader enum standardization effort across the codebase. This provides more robust string-to-enum conversion and is consistent with similar changes in other files liketensorrt_llm/builder.py
andtensorrt_llm/runtime/model_runner.py
.cpp/tensorrt_llm/nanobind/batch_manager/buffers.h (1)
1-30
: LGTM! Well-structured nanobind bindings header.This header file follows the established patterns for nanobind binding headers in the codebase. The structure is clean with proper licensing, include guards, namespace organization, and a simple class declaration for binding initialization.
tensorrt_llm/runtime/model_runner.py (1)
89-89
: LGTM! Consistent enum initialization pattern.The change from direct constructor usage to the
from_string
method aligns with the standardization effort across the codebase forKVCacheType
enum initialization. This provides more explicit and consistent enum parsing.cpp/tensorrt_llm/nanobind/batch_manager/bindings.h (1)
1-29
: LGTM! Clean and consistent nanobind bindings header.This header follows the established patterns for nanobind binding headers with proper licensing, include guards, namespace organization, and a simple function declaration for binding initialization. The structure is consistent with other binding headers in the codebase.
cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp (1)
87-88
: LGTM! Ownership change supports nanobind bindings.The change from
std::make_unique
tostd::make_shared
fortransformerBuffers
is appropriate for enabling shared ownership between the C++ runtime and Python bindings. This aligns with the broader nanobind integration effort and follows good practices for exposing C++ objects to Python.cpp/tensorrt_llm/nanobind/executor/request.h (1)
1-30
: LGTM! Well-organized executor request bindings header.This header follows the established patterns for nanobind binding headers with proper licensing, include guards, and namespace organization. The
initRequestBindings
function declaration fits well into the overall executor binding framework and provides a clean interface for initializing request-related Python bindings.cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h (1)
1-30
: LGTM! Header follows established nanobind binding patterns.The header file is well-structured and consistent with other nanobind binding headers in the codebase. The use of
#pragma once
, proper namespace organization, and the standardinitBindings
function declaration all follow established patterns.cpp/tensorrt_llm/nanobind/runtime/bindings.h (1)
1-31
: LGTM! Well-structured header with early initialization support.The header file follows the established nanobind binding patterns and includes both
initBindings
andinitBindingsEarly
functions. The early initialization pattern is commonly used in binding scenarios where certain types need to be registered before dependent types.tensorrt_llm/builder.py (1)
596-596
: LGTM! Standardizes enum construction using thefrom_string
method.This change aligns with the broader migration to nanobind and standardizes enum parsing across the codebase. Using
KVCacheType.from_string()
is more explicit about string parsing and provides better error handling compared to direct construction.cpp/tensorrt_llm/nanobind/CMakeLists.txt (5)
6-22
: LGTM! Comprehensive source file additions for nanobind bindings.The addition of source files across batch_manager, executor, runtime, testing, and userbuffers components properly supports the extensive nanobind bindings being introduced. The organization by component is logical and maintainable.
33-36
: LGTM! Appropriate conditional linking for NVSHMEM.The conditional linking of NVSHMEM libraries only when enabled is a good practice that avoids unnecessary dependencies.
40-46
: LGTM! Addition of CUDA NVML library is appropriate.The addition of
${CUDA_NVML_LIB}
to the target link libraries is appropriate for GPU operations and monitoring capabilities.
56-56
: LGTM! Extended rpath flags for CUDA toolkit stubs.The addition of the CUDA toolkit stubs directory to the rpath is appropriate for proper linking and runtime library resolution.
49-49
: Ensure the correct detailed‐error macro for nanobindI didn’t find any references to either
NB_DETAILED_ERROR_MESSAGES
orPYBIND11_DETAILED_ERROR_MESSAGES
in the repo. Since this target is built withnanobind_add_module
, please confirm whether nanobind supports thePYBIND11_DETAILED_ERROR_MESSAGES
flag or if it should instead use the nanobind‐specificNB_DETAILED_ERROR_MESSAGES
.• File: cpp/tensorrt_llm/nanobind/CMakeLists.txt (around line 49)
cpp/tensorrt_llm/nanobind/userbuffers/bindings.h (1)
1-31
: LGTM! Header follows established nanobind binding patterns.The header file is well-structured and consistent with other nanobind binding headers in the codebase. The use of the
UserBufferBindings
class with a staticinitBindings
method follows the established pattern used throughout the nanobind bindings.cpp/tensorrt_llm/nanobind/runtime/moeBindings.h (1)
1-30
: LGTM! Well-structured header file.The header file follows best practices with proper license headers, include guards, and a clean API design. The function declaration is appropriately placed in the
tensorrt_llm::nanobind::runtime
namespace and follows the established pattern for nanobind binding initialization.cpp/tensorrt_llm/pybind/bindings.cpp (1)
173-173
: LGTM! Aligns enum construction with nanobind patterns.The change from constructor-based enum creation to a static
from_string
method creates consistency between pybind11 and nanobind binding styles, as mentioned in the PR objectives. This uniform API approach will facilitate the migration and provide a consistent interface for users.cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h (1)
1-30
: LGTM! Consistent header file structure.The header file follows the established pattern for nanobind binding headers with proper license headers, include guards, and a clean API design. The
CacheTransceiverBindings
class is appropriately placed in thetensorrt_llm::batch_manager
namespace and follows the binding initialization pattern.cpp/tensorrt_llm/pybind/executor/executorConfig.cpp (1)
339-339
: LGTM! Critical bug fix for deserialization.The change from
state[2]
tostate[3]
correctly fixes the tuple element indexing for the fourth parameter in theExtendedRuntimePerfKnobConfig
deserialization. This aligns with the expected tuple size of 4 elements and ensures proper restoration of theSizeType32
parameter during unpickling.cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h (1)
1-40
: LGTM! Well-organized header file with proper namespace separation.The header file provides a clean API for both KV cache manager and PEFT cache manager bindings. The namespace separation between
tensorrt_llm::batch_manager::kv_cache_manager
andtensorrt_llm::batch_manager
appropriately reflects the different functional areas while maintaining the established pattern for nanobind binding initialization.cpp/tensorrt_llm/pybind/executor/bindings.cpp (1)
247-257
: LGTM! Interface improvement with clearer timeout semantics.The lambda wrapper approach provides better clarity by explicitly naming the parameter
timeout_ms
to indicate the expected unit. The conversion fromdouble
tostd::chrono::milliseconds
is correctly implemented with proper handling of the optional parameter.tests/unittest/bindings/test_bindings_ut.py (3)
313-314
: LGTM! Appropriate test skip for partial nanobind migration.The conditional skip is correctly implemented and aligns with the PR objectives indicating that some tests are not yet supported in the nanobind migration. The clear reason message helps maintainers understand why the test is skipped.
424-425
: LGTM! Consistent test skip approach.The skip decorator follows the same pattern as other nanobind-related test skips, maintaining consistency across the test suite during the migration period.
505-506
: LGTM! Proper conditional test skip.The skip condition is correctly implemented for the nanobind migration, ensuring test stability while the migration is completed.
jenkins/L0_Test.groovy (4)
67-68
: LGTM - Clean addition of nanobind configuration constant.The new
NANOBIND_CONFIG
field follows the existing pattern for other build configurations.
77-77
: LGTM - Consistent addition to build configurations map.The nanobind tarball entry follows the established naming convention and structure.
1731-1731
: LGTM - Appropriate test configuration for nanobind.The A10-Nanobind test configuration uses the correct test list
l0_a10_nanobind
and follows the existing pattern for single GPU tests.
1808-1810
: LGTM - Clean logic for nanobind configuration detection.The conditional logic properly detects "Nanobind" in stage names and assigns the appropriate configuration, following the same pattern as other build configurations.
cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp (1)
30-46
: LGTM - Clean nanobind binding implementation.The binding structure follows nanobind best practices with proper namespace usage and function definitions. The use of lambda functions for type conversions is appropriate.
cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp (3)
35-44
: LGTM - Well-structured enum bindings with clear documentation.The enum bindings properly use
nb::is_arithmetic()
and include descriptive strings for each value, which improves the Python API usability.
46-85
: LGTM - Comprehensive ModelSpec bindings with proper memory management.The class bindings are well-structured with:
- Proper constructor binding
- Consistent use of
nb::rv_policy::reference_internal
for methods returning references- Clear method naming that follows Python conventions
- Proper copy constructor implementation
The return value policy ensures that objects returned by methods maintain proper lifetime relationships with the parent object.
69-69
: Note: Inconsistent return value policy.The
use_logits
method doesn't specify a return value policy while most other similar methods usenb::rv_policy::reference_internal
. This might be intentional if it returns by value rather than reference.Consider verifying that this is the intended behavior based on the actual method signature.
tests/unittest/bindings/test_executor_bindings.py (3)
17-17
: LGTM!The import is necessary for checking the binding type to conditionally skip tests.
488-489
: Expected test skips for partial nanobind migration.These test skips are consistent with the PR's partial migration status. The skip decorator is correctly applied with a clear reason.
Also applies to: 694-695, 1120-1121, 1159-1160, 1507-1508, 1881-1882, 2157-2158
2481-2483
: Good improvement: Using enum instead of string.The change from string
"UCX"
to enumtrtllm.CacheTransceiverBackendType.UCX
improves type safety and API clarity.cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)
43-51
: LGTM!The utility function correctly handles optional tensor conversion, consistent with the pattern used in
kvCacheManager.cpp
.
70-131
: Document whyreturnPerfMetrics
is hardcoded to false.The implementation correctly converts all fields, but
returnPerfMetrics
is hardcoded tofalse
on line 125. If performance metrics aren't supported in the nanobind bindings yet, please add a comment explaining this limitation.- false, // returnPerfMetrics + false, // returnPerfMetrics - TODO: Not yet supported in nanobindcpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp (3)
43-78
: Well-structured TransformerBuffers bindings.The bindings comprehensively expose the TransformerBuffers interface with proper named arguments and read-write properties.
88-91
: Good handling of shared_ptr property.The custom getter/setter for
transformerBuffers
properly handles the shared_ptr ownership semantics.
102-106
: Members Are Publicly Declared – No Action NeededThe fields
mCacheIndirDecoderIOBatchedCopySrcOffsetsSliceDevice
,mCacheIndirDecoderIOBatchedCopyDstOffsetsSliceDevice
, andmCacheIndirDecoderIOBatchedCopyCopySizesDevice
are defined above theprivate:
marker inruntimeBuffers.h
, so they are already public and can safely be exposed via Python bindings.Likely an incorrect or invalid review comment.
cpp/tensorrt_llm/nanobind/common/bindTypes.h (1)
63-98
: Excellent set binding implementation with pickling support.The set binding template is comprehensive, including proper pickling support for serialization. The implementation correctly handles all standard set operations and follows nanobind best practices.
jenkins/Build.groovy (1)
50-55
: LGTM! Nanobind build configurations properly integratedThe new Nanobind build configurations follow the existing pattern and are properly integrated into the build pipeline. The configurations correctly specify:
- The
--binding_type nanobind
flag- Appropriate tarball names with "nanobind-" prefix
- Matching wheel architectures for each platform
Also applies to: 65-69, 85-89, 542-543
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1)
55-178
: Well-structured algorithm bindingsThe nanobind bindings for batch manager algorithms are properly implemented:
- Consistent pattern across all algorithm classes
- Appropriate use of TorchView for tensor conversions
- Proper handling of optional arguments with default values
- Clear argument naming for Python API
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (1)
89-150
: Complex initialization list is correctly implementedThe initialization list properly handles the conversion of optional vectors to shared pointers, which is necessary for the base class. The use of ternary operators and move semantics is appropriate for efficiency.
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (2)
33-47
: Good input validation in helper functionsThe helper functions
pyDoReplication
andpyDoPlacement
properly validate that the expert count matches the size of the expert load factor vector before calling the runtime functions. This prevents potential segmentation faults or undefined behavior.
49-122
: Well-implemented MOE bindingsThe nanobind bindings for MOE components are properly structured:
- Clear class and method documentation strings
- Appropriate use of
def_rw
for read-write properties- Nice
__repr__
implementation for MoeWeight for debugging- Consistent argument naming across all methods
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
73-107
: LGTM! Well-structured serialization support.The
__getstate__
and__setstate__
implementation properly handles all 19 fields of SamplingConfig with appropriate type conversions and validation.cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (3)
63-237
: LGTM! Comprehensive trampoline implementation.The
PyKvCacheManager
trampoline class correctly implements all 28 pure virtual methods fromBaseKVCacheManager
using the appropriateNB_OVERRIDE_PURE
macro.
378-385
: Good error handling pattern for tensor conversions.The tensor conversion with validation using
TLLM_CHECK_WITH_INFO
provides clear error messages. This pattern is consistently applied throughout the file.
456-464
: Excellent GIL management for performance.The
ensure_batch
method correctly releases the GIL before calling the potentially expensive C++ operation, allowing Python concurrency.cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
144-162
: Well-implemented managed weights handling.The constructor properly converts Python dictionary of numpy arrays to C++ map of Tensors with appropriate null checking and type conversions.
189-198
: Excellent GIL management and documentation in shutdown.The shutdown method correctly releases the GIL to prevent deadlock, with clear explanation of why this is necessary. The trace logging aids debugging.
cpp/tensorrt_llm/nanobind/executor/executor.h (2)
61-86
: Consistent and correct GIL handling in all await methods.All three
awaitResponses
overloads correctly release the GIL with clear documentation explaining the rationale. The [[nodiscard]] attributes ensure callers handle return values.
30-129
: Well-designed executor wrapper interface.The
Executor
class provides a clean Python-friendly interface with:
- Multiple construction options
- Proper context manager support
- Consistent error handling patterns
- Thread-safe operations with appropriate GIL management
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
36-76
: LGTM! Trampoline class correctly implements pure virtual methods.The
PyCacheTransceiver
trampoline class is properly structured withNB_TRAMPOLINE
declaration matching the 6 pure virtual methods. All overrides useNB_OVERRIDE_PURE
correctly.cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
52-75
: Pickling implementation is correct.The
DynamicBatchConfig
pickling properly validates tuple size and correctly reconstructs the object.
444-456
: Well-implemented enum string conversion.The
from_string
method properly handles case-insensitive conversion and provides clear error messages.cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
59-81
: Exception registration implemented correctly.The custom exception types are properly registered and the translator correctly maps C++ exceptions to Python errors.
cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (2)
61-97
: Trampoline class correctly implements the IGptDecoder interface.All 5 pure virtual methods are properly overridden using
NB_OVERRIDE_PURE
.
124-145
: Constructor overloads properly handle both file and buffer inputs.The buffer constructor correctly validates the array dimension and both constructors appropriately use the default logger.
d013a3a
to
8613975
Compare
/bot run |
PR_Github #12315 [ run ] triggered by Bot |
PR_Github #12313 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (8)
cpp/tensorrt_llm/nanobind/common/customCasters.h (3)
104-120
: Potential lifetime issue with OptionalRef construction.Creating an
OptionalRef
with a reference to the localconv
object is unsafe. Theconv
object will be destroyed when the function returns, leaving a dangling reference.
312-319
: Add error handling for DLPack operations.The
from_python
method should handle potential failures in DLPack conversion.
302-303
: Use std::const_pointer_cast instead of reinterpret_cast for SharedConstPtr conversionThe current reinterpret_cast is undefined behavior. Replace it with
std::const_pointer_cast
to safely strip the const qualifier.cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
91-97
: Add validation fornum_kv_heads_per_layer
vector size.The constructor accepts a vector of KV heads per layer but doesn't validate that its size matches the expected number of layers from the cache manager. This could lead to out-of-bounds access or incorrect cache layouts.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
584-584
: Parameter name inconsistency with property accessors.The constructor parameter
mm_embedding_offloading
doesn't match the property accessor namesprompt_table_offloading
. This inconsistency could confuse users.Also applies to: 631-632
cpp/tensorrt_llm/nanobind/runtime/bindings.cpp (2)
321-329
: Add validation for PyTorch stream object.The function assumes
py_stream
has acuda_stream
attribute without validation, which could cause crashes if an incorrect object is passed.
307-319
: Add null pointer validation for buffer addresses.The lamport functions cast integer addresses to pointers without validation, which could cause segmentation faults if 0 is passed.
cpp/tensorrt_llm/nanobind/bindings.cpp (1)
363-389
: Replace assert with proper error handling.The
assert
at line 365 will be disabled in release builds, potentially leading to undefined behavior if the tuple size is incorrect. Replace it with a proper runtime check that throws an exception.auto SamplingConfigSetState = [](tr::SamplingConfig& self, nb::tuple t) -> tr::SamplingConfig { - assert(t.size() == 19); + if (t.size() != 19) { + throw std::runtime_error("Invalid state tuple size for SamplingConfig: expected 19, got " + std::to_string(t.size())); + } tr::SamplingConfig config;
🧹 Nitpick comments (3)
cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp (1)
100-130
: Consider releasing GIL for potentially expensive operationsThe
HandleContextLogits
andHandleGenerationLogits
operations might be computationally expensive. Consider releasing the GIL to allow other Python threads to run during these operations.Example for HandleContextLogits:
[](HandleContextLogits const& self, DecoderInputBuffers& inputBuffers, RequestVector const& contextRequests, at::Tensor const& logits, std::vector<tr::SizeType32> const& numContextLogitsVec, tr::ModelConfig const& modelConfig, tr::BufferManager const& manager, OptionalRef<MedusaBuffers> medusaBuffers = std::nullopt) { + nb::gil_scoped_release release; return self(inputBuffers, contextRequests, tr::TorchView::of(logits), numContextLogitsVec, modelConfig, manager, medusaBuffers); },
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
253-376
: Consider using a builder pattern for the complex constructor.The
LlmRequest
constructor has 49 parameters, which makes it difficult to use and maintain. While the implementation is correct, consider introducing a builder pattern or configuration object to improve the API usability.
435-439
: Usesize_t
instead ofint
for loop counter.The loop iterates over a vector but uses
int
as the counter type. Usesize_t
to match the vector's size type and avoid potential issues with large vectors.- for (int i = 0; i < requests.size(); ++i) + for (size_t i = 0; i < requests.size(); ++i)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (49)
cpp/CMakeLists.txt
(2 hunks)cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h
(1 hunks)cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/buffers.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h
(1 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
(1 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(1 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/run.py
(1 hunks)examples/models/core/qwenvl/run.py
(1 hunks)jenkins/Build.groovy
(4 hunks)jenkins/L0_Test.groovy
(3 hunks)tensorrt_llm/builder.py
(1 hunks)tensorrt_llm/commands/build.py
(2 hunks)tensorrt_llm/runtime/model_runner.py
(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/bindings/test_bindings_ut.py
(4 hunks)tests/unittest/bindings/test_executor_bindings.py
(10 hunks)
✅ Files skipped from review due to trivial changes (3)
- cpp/CMakeLists.txt
- tests/integration/test_lists/test-db/l0_a10.yml
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
🚧 Files skipped from review as they are similar to previous changes (36)
- cpp/include/tensorrt_llm/batch_manager/runtimeBuffers.h
- examples/models/core/llama/summarize_long.py
- examples/models/core/qwen2audio/run.py
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
- examples/models/core/qwenvl/run.py
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
- tensorrt_llm/commands/build.py
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
- cpp/tensorrt_llm/nanobind/executor/executorConfig.h
- cpp/tensorrt_llm/nanobind/CMakeLists.txt
- cpp/tensorrt_llm/nanobind/executor/bindings.h
- cpp/tensorrt_llm/nanobind/runtime/bindings.h
- cpp/tensorrt_llm/batch_manager/runtimeBuffers.cpp
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/buffers.h
- cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
- cpp/tensorrt_llm/nanobind/executor/request.h
- tensorrt_llm/runtime/model_runner.py
- tensorrt_llm/builder.py
- cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
- tests/unittest/bindings/test_executor_bindings.py
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
- cpp/tensorrt_llm/pybind/executor/bindings.cpp
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/buffers.cpp
- tests/unittest/bindings/test_bindings_ut.py
- jenkins/L0_Test.groovy
- cpp/tensorrt_llm/nanobind/common/bindTypes.h
- jenkins/Build.groovy
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
- cpp/tensorrt_llm/nanobind/executor/request.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.h
🧰 Additional context used
🧠 Learnings (3)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/common/customCasters.h (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (1)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
shutdown
(189-198)shutdown
(189-189)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (20)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
458-464
: Good practice: GIL release for potentially blocking operationGreat job releasing the GIL in
ensure_batch
. This allows other Python threads to run while the potentially expensive batch ensuring operation executes.cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
468-536
: Well-implemented pickling for ExecutorConfig.The pickling implementation properly handles both C++ state and Python dynamic attributes, and correctly uses
nb::inst_mark_ready()
to finalize the object after reconstruction. This is a robust solution for complex objects with dynamic attributes.cpp/tensorrt_llm/nanobind/bindings.cpp (18)
1-61
: Well-organized header section with comprehensive includes.The header section is well-structured with:
- Updated copyright notice
- Comprehensive nanobind includes for STL containers and operators
- Proper inclusion of torch and TensorRT-LLM components
- Clean namespace aliases that improve readability
- Helpful template type aliases
68-75
: Clean utility function implementation.The
makeSamplingConfig
helper function is properly placed in an anonymous namespace and provides a simple wrapper for the SamplingConfig constructor.
76-111
: Comprehensive MPI communication bindings.The MpiComm bindings are well-implemented with:
- Proper static method bindings for singleton access
- Complete coverage of rank, size, and splitting operations
- Appropriate lambda wrappers for accessing MPI session instances
112-122
: Proper CUDA stream binding implementation.The CudaStream binding correctly handles Python-to-CUDA stream conversion using nanobind's in-place construction pattern and proper type casting.
123-132
: Well-organized submodule hierarchy.The submodule structure is logical with clear separation of concerns:
- Executor bindings in dedicated submodule
- Internal components grouped appropriately
- Early initialization calls for dependency management
133-135
: Simple build information exposure.Clean exposure of build-time configuration to Python through a dedicated submodule.
136-157
: Comprehensive PEFT cache manager configuration binding.The PeftCacheManagerConfig binding is well-implemented with:
- Complete constructor parameter binding with appropriate defaults
- Read-write properties for all configuration fields
- Proper handling of optional parameters with std::nullopt
158-208
: Comprehensive enum bindings with string conversion support.The enum bindings are well-implemented with:
- Complete coverage of all enum values
- Proper value exports for direct access
- KVCacheType includes a useful
from_string
static method for configuration parsing
209-224
: Well-designed LoraModule binding.The LoraModule binding follows good patterns with:
- Comprehensive constructor parameter binding
- Read-only properties appropriate for configuration data
- Static factory method for creating LoRA modules
225-268
: Comprehensive quantization mode binding.The QuantMode binding is excellent with:
- Complete static factory methods for all quantization types
- Comprehensive property bindings for feature detection
- Proper operator overloads for flag-like behavior
- Multiple factory methods supporting different use cases
269-324
: Comprehensive model configuration binding.The ModelConfig binding is very well-implemented with:
- Complete constructor and property bindings
- Appropriate mix of read-only and read-write properties
- Proper handling of overloaded methods with different signatures
- Good coverage of all model configuration aspects
325-354
: Comprehensive world configuration binding.The WorldConfig binding is well-designed with:
- Complete constructor with appropriate defaults
- Comprehensive read-only properties for world state
- Static MPI factory method for distributed setup
- Proper handling of optional device IDs
391-418
: Comprehensive sampling configuration binding.The SamplingConfig binding is well-implemented with:
- Multiple constructors for different use cases
- Complete read-write properties for all sampling parameters
- Proper pickle support with serialization methods
- Equality operator for comparisons
419-422
: Standard utility bindings.Clean vector binding and utility function exposure for SamplingConfig operations.
423-449
: Comprehensive JSON configuration binding.The GptJsonConfig binding is well-implemented with:
- Complete constructor and property bindings
- Static factory methods for parsing JSON from strings and files
- Overloaded methods for different engine filename generation scenarios
- Appropriate read-only properties for configuration data
450-470
: Comprehensive state tracking and memory monitoring bindings.Good implementation with:
- Complete LlmRequestState enum covering all request states
- Proper singleton access for MemoryCounters with reference return policy
- Read-only properties for different memory counter types
471-484
: Well-organized binding initialization sequence.The initialization sequence is properly structured with:
- Logical order of component initialization
- Appropriate submodule creation for specialized components
- Complete coverage of all binding components
485-497
: Proper NVLS memory handling bindings.The NVLS bindings are well-implemented with:
- Complete IpcNvlsHandle class binding with necessary properties
- Custom method for accessing IPC pointers for C interface compatibility
- Module-level functions for NVLS allocation and management operations
20eec02
to
4ff44cf
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (10)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
110-110
: GIL handling issue in shutdown methodAs noted in the past review comment, the
MoeLoadBalancer::shutdown()
method should release the GIL to prevent potential deadlocks if the load balancer has spawned threads that need to acquire the GIL during callbacks.cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)
246-256
: Missing timeout validationAs noted in the past review comment, the timeout conversion from double to milliseconds doesn't validate negative values, which could cause unexpected behavior.
cpp/tensorrt_llm/nanobind/executor/request.cpp (3)
913-930
: Const-correctness violation in clear methodsAs noted in the past review comment, the
clear_context_logits
andclear_generation_logits
methods useconst_cast
to modify a const reference, violating const-correctness principles.
580-581
: Hard-coded parameter value in deserializationAs noted in the past review comment, the hard-coded value
1
fornumReturnSequences
in the Request constructor during deserialization should be properly handled.
433-450
: Critical control flow bug in setstateAs noted in the past review comment, the
ContextPhaseParamsSetState
function has incorrect control flow, executing both constructor calls instead of using proper conditional logic.cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
409-417
: Add validation for negative timeout valuesSimilar to the executor bindings, the timeout conversion should validate negative values to prevent invalid timeout durations.
[](tbk::BaseKVCacheManager& self, std::optional<double> timeout_ms = std::nullopt) { if (timeout_ms) { + if (*timeout_ms < 0) + { + throw std::invalid_argument("timeout_ms must be non-negative"); + } return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms))); } return self.getLatestEvents(std::nullopt); },cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (3)
584-584
: Parameter name inconsistency with property accessorsThe constructor parameter is named
mm_embedding_offloading
but the property accessors useprompt_table_offloading
. This inconsistency could confuse users.- nb::arg("gather_generation_logits") = false, nb::arg("mm_embedding_offloading") = false, + nb::arg("gather_generation_logits") = false, nb::arg("prompt_table_offloading") = false,
631-632
: Fix property name inconsistencyThe property accessor uses
mm_embedding_offloading
but should match the underlying method names that useprompt_table_offloading
.- .def_prop_rw("mm_embedding_offloading", &tle::ExecutorConfig::getPromptTableOffloading, - &tle::ExecutorConfig::setPromptTableOffloading) + .def_prop_rw("prompt_table_offloading", &tle::ExecutorConfig::getPromptTableOffloading, + &tle::ExecutorConfig::setPromptTableOffloading)
352-353
: Fix the index error in deserializationThere's a critical bug where
state[2]
is used twice. The fourth parameter should usestate[3]
.- new (&self) tle::ExtendedRuntimePerfKnobConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]), - nb::cast<bool>(state[2]), nb::cast<SizeType32>(state[2])); + new (&self) tle::ExtendedRuntimePerfKnobConfig(nb::cast<bool>(state[0]), nb::cast<bool>(state[1]), + nb::cast<bool>(state[2]), nb::cast<SizeType32>(state[3]));cpp/tensorrt_llm/nanobind/bindings.cpp (1)
362-388
: Replace assert with proper error handling.The
assert
at line 364 will be disabled in release builds, potentially leading to undefined behavior if the tuple size is incorrect. Replace it with a proper runtime check that throws an exception.Apply this diff to fix the issue:
- assert(t.size() == 19); + if (t.size() != 19) { + throw std::runtime_error("Invalid state tuple size for SamplingConfig: expected 19, got " + std::to_string(t.size())); + }
🧹 Nitpick comments (3)
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
378-406
: Add validation for tensor arguments in batch operationsThe copy_batch_block_offsets method should validate that requestIds is not empty to prevent unnecessary iterations and potential issues.
[](tbk::BaseKVCacheManager& self, at::Tensor output, std::vector<tb::LlmRequest::RequestIdType> const& requestIds, SizeType32 const beamWidth, SizeType32 const offset) { auto _output = from_torch(output); TLLM_CHECK_WITH_INFO(_output.has_value(), "Invalid output tensor."); + TLLM_CHECK_WITH_INFO(!requestIds.empty(), "requestIds cannot be empty."); for (size_t i = 0; i < requestIds.size(); ++i) { self.copyBlockOffsets(*(_output.value()), i * beamWidth + offset, requestIds[i]); } })
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
433-439
: Add boundary checking for vector accessThe loop should validate that the index is within bounds before accessing the requests vector.
TLLM_CHECK_WITH_INFO(requests.size() == tokens.size(), "Expected the same number of requests and tokens."); - for (int i = 0; i < requests.size(); ++i) + for (size_t i = 0; i < requests.size(); ++i) { + TLLM_CHECK_WITH_INFO(i < requests.size(), "Index out of bounds."); requests[i]->addNewToken(tokens[i], beam_idx); }cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
492-496
: Add validation for state tuple sizeThe hardcoded size check should be made more maintainable by using a named constant.
+ static constexpr size_t EXPECTED_CPP_STATES_SIZE = 28; auto cpp_states = nb::cast<nb::tuple>(state[0]); - if (cpp_states.size() != 28) + if (cpp_states.size() != EXPECTED_CPP_STATES_SIZE) { throw std::runtime_error("Invalid cpp_states!"); }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (45)
cpp/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h
(1 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
(1 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(1 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/run.py
(1 hunks)examples/models/core/qwenvl/run.py
(1 hunks)jenkins/Build.groovy
(4 hunks)jenkins/L0_Test.groovy
(3 hunks)tensorrt_llm/builder.py
(1 hunks)tensorrt_llm/commands/build.py
(2 hunks)tensorrt_llm/runtime/model_runner.py
(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/bindings/test_bindings_ut.py
(4 hunks)tests/unittest/bindings/test_executor_bindings.py
(10 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/test_lists/test-db/l0_a10.yml
🚧 Files skipped from review as they are similar to previous changes (36)
- examples/models/core/llama/summarize_long.py
- tensorrt_llm/runtime/model_runner.py
- examples/models/core/qwenvl/run.py
- cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
- examples/models/core/qwen2audio/run.py
- cpp/tensorrt_llm/nanobind/CMakeLists.txt
- cpp/tensorrt_llm/nanobind/runtime/bindings.h
- cpp/CMakeLists.txt
- cpp/tensorrt_llm/nanobind/executor/executorConfig.h
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/nanobind/executor/request.h
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/nanobind/executor/bindings.h
- cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
- cpp/tensorrt_llm/pybind/executor/bindings.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
- tests/unittest/bindings/test_executor_bindings.py
- tensorrt_llm/builder.py
- tensorrt_llm/commands/build.py
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
- tests/unittest/bindings/test_bindings_ut.py
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
- jenkins/Build.groovy
- cpp/tensorrt_llm/nanobind/common/bindTypes.h
- cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
- cpp/tensorrt_llm/nanobind/common/customCasters.h
- cpp/tensorrt_llm/nanobind/executor/executor.h
- jenkins/L0_Test.groovy
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.cpp
- cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
🧰 Additional context used
🧠 Learnings (2)
cpp/tensorrt_llm/nanobind/executor/request.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (1)
cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
shutdown
(189-198)shutdown
(189-189)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)tensorrt_llm/llmapi/llm_args.py (3)
SchedulerConfig
(614-632)PeftCacheConfig
(636-699)ExtendedRuntimePerfKnobConfig
(847-874)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (30)
cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp (2)
33-39
: Input validation is appropriateThe helper function correctly validates that the expert count matches the expert load factor size before proceeding with the operation.
41-47
: Consistent validation patternThe
pyDoPlacement
function follows the same validation pattern aspyDoReplication
, ensuring consistent error handling.cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (3)
43-50
: Tensor conversion function is well-implementedThe
from_torch
helper function correctly handles optional tensor conversion using TorchView, with proper null checking.
54-68
: Callback adapter handles tensor conversion properlyThe callback adapter correctly converts internal tensor representations to PyTorch tensors for the user-provided callback, with proper stream handling.
70-131
: Complex constructor with many parametersThe
toTrtLlm
method constructs a request with 49 parameters, which is complex but appears necessary for the comprehensive LLM request configuration. The parameter ordering and types are consistent with the header declaration.cpp/tensorrt_llm/nanobind/executor/bindings.cpp (2)
56-88
: Well-implemented custom serialization for DecodingModeThe custom
__getstate__
and__setstate__
methods properly handle the DecodingMode serialization with appropriate state validation.
105-115
: Comprehensive statistics class bindingsThe KvCacheStats class binding properly exposes all relevant statistics fields for monitoring cache performance.
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
73-107
: Comprehensive SamplingConfig serializationThe SamplingConfig serialization properly handles all 19 parameters with appropriate type casting and validation.
585-662
: Extensive Request constructor with proper defaultsThe Request constructor binding handles a comprehensive set of parameters with appropriate default values and keyword-only arguments for better usability.
cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (2)
63-237
: Well-structured trampoline class implementationThe PyKvCacheManager trampoline class correctly implements all 28 virtual methods with NB_OVERRIDE_PURE macros, enabling proper Python subclassing of the abstract BaseKVCacheManager class.
54-61
: Proper tensor conversion handlingThe from_torch helper function correctly handles optional tensor conversion from PyTorch to runtime tensors with proper null checking.
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (2)
59-81
: Proper exception registration and handlingThe exception registration for PeftTaskNotCachedException and LoraCacheFullException is correctly implemented with proper translator setup.
429-523
: Well-structured batch processing functionsThe free functions add_new_tokens_to_requests and make_decoding_batch_input are well-implemented with proper error checking and efficient tensor operations.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
468-536
: Complex but well-structured serialization logicThe ExecutorConfig serialization/deserialization implementation correctly handles both C++ state and Python attributes using nb::inst_mark_ready(), enabling proper object lifecycle management.
cpp/tensorrt_llm/nanobind/bindings.cpp (16)
1-61
: LGTM - Well-organized headers and namespace declarations.The includes are comprehensive and properly organized, with appropriate namespace aliases for readability.
67-73
: LGTM - Clean helper function implementation.The anonymous namespace usage is appropriate for the file-local helper function.
82-109
: LGTM - Comprehensive MpiComm bindings.The static method bindings are properly implemented with appropriate lambda wrappers.
111-120
: LGTM - Appropriate CudaStream binding.The custom constructor properly handles the conversion from Python stream objects to CUDA stream handles.
122-133
: LGTM - Well-organized submodule structure.The hierarchical submodule organization and early initialization calls are appropriate.
135-155
: LGTM - Comprehensive PeftCacheManagerConfig bindings.The binding properly exposes all configuration options with appropriate defaults and optional parameter handling.
157-206
: LGTM - Comprehensive enum bindings.The enum bindings properly expose all values, and the
from_string
method for KVCacheType is a useful addition for parsing.
208-222
: LGTM - Well-structured LoraModule bindings.The read-only properties and static factory method are appropriately designed for this configuration class.
224-266
: LGTM - Comprehensive QuantMode bindings.The extensive static factory methods and operator overloads are appropriate for this quantization configuration class.
268-322
: LGTM - Extensive ModelConfig bindings.The comprehensive property exposure and proper handling of method overloads with nb::overload_cast are well-implemented.
324-352
: LGTM - Well-structured WorldConfig bindings.The read-only properties and static MPI factory method are appropriately designed for this configuration class.
390-420
: LGTM - Comprehensive SamplingConfig bindings.The class bindings, vector binding, and helper function are well-implemented.
422-447
: LGTM - Well-structured GptJsonConfig bindings.The static parsing methods and read-only properties are appropriately designed for this configuration class.
449-468
: LGTM - Appropriate enum and singleton bindings.The LlmRequestState enum and MemoryCounters singleton binding are well-implemented.
470-481
: LGTM - Well-organized submodule initialization.The initialization calls are properly organized and follow a logical pattern.
483-495
: LGTM - Appropriate NVLS bindings.The IpcNvlsHandle bindings and global functions are well-implemented for NVLS memory management.
PR_Github #12317 [ run ] triggered by Bot |
PR_Github #12315 [ run ] completed with state |
PR_Github #12317 [ run ] completed with state |
This reverts commit b75e53a. Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]>
4ff44cf
to
53e389a
Compare
/bot run |
PR_Github #12382 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (9)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (1)
246-256
: Add validation for negative timeout valuesThe timeout conversion from double to milliseconds doesn't validate negative values. Negative timeouts could cause unexpected behavior.
[](tle::KVCacheEventManager& self, std::optional<double> timeout_ms = std::nullopt) { if (timeout_ms) { + if (*timeout_ms < 0) + { + throw std::invalid_argument("timeout_ms must be non-negative"); + } return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms))); } return self.getLatestEvents(std::nullopt); },cpp/tensorrt_llm/nanobind/common/customCasters.h (3)
104-120
: Potential lifetime issue with OptionalRef constructionOn line 118, creating an
OptionalRef
with a reference to the localconv
object is unsafe. Theconv
object will be destroyed when the function returns, leaving a dangling reference.This appears to be a fundamental design issue with how
OptionalRef
is being used. The caster should ensure the referenced object has appropriate lifetime guarantees. Consider ifOptionalRef
is the right type to use here, or if the binding needs to manage object lifetime differently.
302-303
: Use std::const_pointer_cast instead of reinterpret_cast for SharedConstPtr conversionThe current reinterpret_cast of
ITensor::SharedConstPtr
toITensor::SharedPtr
is undefined behavior—even though both arestd::shared_ptr
instantiations, they are distinct types. Replace it withstd::const_pointer_cast
to safely strip the const qualifier while preserving the control block.- return THPVariable_Wrap(tensorrt_llm::runtime::Torch::tensor( - reinterpret_cast<tensorrt_llm::runtime::ITensor::SharedPtr const&>(src))); + return THPVariable_Wrap(tensorrt_llm::runtime::Torch::tensor( + std::const_pointer_cast<tensorrt_llm::runtime::ITensor>(src)));
312-319
: Add error handling for DLPack operationsThe
from_python
method should handle potential failures in DLPack conversion.bool from_python(nb::handle src, uint8_t, cleanup_list*) noexcept { - nb::object capsule = nb::getattr(src, "__dlpack__")(); - DLManagedTensor* dl_managed = static_cast<DLManagedTensor*>(PyCapsule_GetPointer(capsule.ptr(), "dltensor")); - PyCapsule_SetDestructor(capsule.ptr(), nullptr); - value = at::fromDLPack(dl_managed).alias(); - return true; + try { + nb::object capsule = nb::getattr(src, "__dlpack__")(); + if (!PyCapsule_CheckExact(capsule.ptr())) + return false; + DLManagedTensor* dl_managed = static_cast<DLManagedTensor*>(PyCapsule_GetPointer(capsule.ptr(), "dltensor")); + if (!dl_managed) + return false; + PyCapsule_SetDestructor(capsule.ptr(), nullptr); + value = at::fromDLPack(dl_managed).alias(); + return true; + } catch (...) { + return false; + } }cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp (1)
409-417
: Add validation for negative timeout valuesSimilar to the executor bindings, the timeout conversion should validate negative values.
[](tbk::BaseKVCacheManager& self, std::optional<double> timeout_ms = std::nullopt) { if (timeout_ms) { + if (*timeout_ms < 0) + { + throw std::invalid_argument("timeout_ms must be non-negative"); + } return self.getLatestEvents(std::chrono::milliseconds(static_cast<int64_t>(*timeout_ms))); } return self.getLatestEvents(std::nullopt); },cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
254-376
: Constructor complexity needs to be addressed.The 49-parameter constructor remains overly complex and error-prone. Consider implementing a builder pattern as previously suggested to improve maintainability and usability.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
352-354
: Fix the deserialization index error.The fourth parameter incorrectly uses
state[2]
instead ofstate[3]
, causing incorrect deserialization.
584-584
: Parameter naming inconsistency.The constructor parameter
mm_embedding_offloading
is inconsistent with the property nameprompt_table_offloading
, which could confuse users.Also applies to: 631-632
cpp/tensorrt_llm/nanobind/bindings.cpp (1)
362-389
: Replace assert with proper runtime error handling.The assert will be disabled in release builds, potentially causing undefined behavior if the tuple size is incorrect.
🧹 Nitpick comments (4)
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp (2)
73-74
: Simplify draft tokens handlingThe creation of
optDraftTokens
seems redundant sincemDraftTokens
is always initialized (either with provided tokens or an empty vector according to the constructor). Consider simplifying this.- auto const draftTokens = std::make_shared<std::vector<TokenIdType>>(*mDraftTokens.get()); - auto const optDraftTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(draftTokens); + auto const optDraftTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(mDraftTokens);
75-78
: Consider avoiding unnecessary vector copy for encoder tokensThe current implementation creates a copy of the encoder tokens vector. Since the tokens are already managed by a shared_ptr in the member variable, consider reusing it directly.
- auto const encoderInputTokens = mEncoderTokens.has_value() - ? std::make_shared<std::vector<TokenIdType>>(*mEncoderTokens.value().get()) - : nullptr; - auto const optEncoderInputTokens = std::optional<std::shared_ptr<std::vector<TokenIdType>>>(encoderInputTokens); + auto const optEncoderInputTokens = mEncoderTokens;cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (2)
54-152
: Consider design patterns to manage constructor complexity.The constructor with 49 parameters is extremely complex and potentially error-prone. While this may be constrained by the underlying
GenericLlmRequest
API, consider these approaches:
- Builder pattern: Create a builder class to construct requests incrementally
- Parameter object pattern: Group related parameters into configuration structs
- Factory methods: Provide static factory methods for common use cases
The conditional
shared_ptr
creation logic is repetitive and could benefit from helper functions.Example helper function to reduce repetition:
template<typename T> static auto makeOptionalShared(std::optional<T>&& opt) { return opt.has_value() ? std::make_optional(std::make_shared<T>(std::move(*opt))) : std::optional<std::shared_ptr<T>>(std::nullopt); }
39-158
: Overall design is appropriate for a binding layer with room for improvement.The
LlmRequest
class effectively adapts the underlyingGenericLlmRequest
for nanobind Python bindings. Key strengths:
- Proper integration with PyTorch tensors (
at::Tensor
)- Clear documentation of stream handling limitations
- Appropriate adapter pattern implementation
Areas for potential improvement:
- Constructor complexity could benefit from builder or factory patterns
- Consider adding validation for parameter combinations
- Enhanced documentation for the extensive parameter set would help users
The design appropriately handles the complexity inherited from the underlying C++ API while providing a clean Python interface.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (45)
cpp/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/CMakeLists.txt
(2 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h
(1 hunks)cpp/tensorrt_llm/nanobind/bindings.cpp
(2 hunks)cpp/tensorrt_llm/nanobind/common/bindTypes.h
(1 hunks)cpp/tensorrt_llm/nanobind/common/customCasters.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executor.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/executorConfig.h
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/executor/request.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/bindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
(1 hunks)cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
(1 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/bindings.cpp
(1 hunks)cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
(1 hunks)examples/models/core/llama/summarize_long.py
(1 hunks)examples/models/core/qwen2audio/run.py
(1 hunks)examples/models/core/qwenvl/run.py
(1 hunks)jenkins/Build.groovy
(4 hunks)jenkins/L0_Test.groovy
(3 hunks)tensorrt_llm/builder.py
(1 hunks)tensorrt_llm/commands/build.py
(2 hunks)tensorrt_llm/runtime/model_runner.py
(1 hunks)tests/integration/test_lists/test-db/l0_a10.yml
(1 hunks)tests/unittest/bindings/test_bindings_ut.py
(4 hunks)tests/unittest/bindings/test_executor_bindings.py
(10 hunks)
✅ Files skipped from review due to trivial changes (2)
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.cpp
- cpp/tensorrt_llm/nanobind/runtime/bindings.cpp
🚧 Files skipped from review as they are similar to previous changes (35)
- examples/models/core/qwenvl/run.py
- tensorrt_llm/commands/build.py
- tensorrt_llm/runtime/model_runner.py
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.h
- examples/models/core/qwen2audio/run.py
- cpp/tensorrt_llm/nanobind/batch_manager/bindings.h
- cpp/tensorrt_llm/nanobind/CMakeLists.txt
- cpp/tensorrt_llm/nanobind/executor/bindings.h
- cpp/tensorrt_llm/nanobind/userbuffers/bindings.h
- cpp/tensorrt_llm/pybind/executor/executorConfig.cpp
- tests/integration/test_lists/test-db/l0_a10.yml
- cpp/tensorrt_llm/nanobind/runtime/moeBindings.h
- cpp/CMakeLists.txt
- cpp/tensorrt_llm/nanobind/executor/executorConfig.h
- cpp/tensorrt_llm/pybind/executor/bindings.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/nanobind/runtime/bindings.h
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.h
- cpp/tensorrt_llm/nanobind/batch_manager/kvCacheManager.h
- cpp/tensorrt_llm/nanobind/executor/request.h
- tensorrt_llm/builder.py
- cpp/tensorrt_llm/pybind/bindings.cpp
- cpp/tensorrt_llm/nanobind/testing/modelSpecBinding.cpp
- tests/unittest/bindings/test_executor_bindings.py
- examples/models/core/llama/summarize_long.py
- tests/unittest/bindings/test_bindings_ut.py
- jenkins/Build.groovy
- cpp/tensorrt_llm/nanobind/common/bindTypes.h
- cpp/tensorrt_llm/nanobind/batch_manager/algorithms.cpp
- jenkins/L0_Test.groovy
- cpp/tensorrt_llm/nanobind/runtime/moeBindings.cpp
- cpp/tensorrt_llm/nanobind/executor/request.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.cpp
- cpp/tensorrt_llm/nanobind/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/nanobind/executor/executor.h
🧰 Additional context used
🧠 Learnings (2)
cpp/tensorrt_llm/nanobind/common/customCasters.h (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🧬 Code Graph Analysis (2)
cpp/tensorrt_llm/nanobind/executor/bindings.cpp (3)
cpp/tensorrt_llm/nanobind/executor/request.cpp (2)
initRequestBindings
(53-933)initRequestBindings
(53-53)cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
initConfigBindings
(46-637)initConfigBindings
(46-46)cpp/tensorrt_llm/nanobind/executor/executor.cpp (2)
initBindings
(200-239)initBindings
(200-200)
cpp/tensorrt_llm/nanobind/executor/executorConfig.cpp (2)
cpp/include/tensorrt_llm/runtime/decodingInput.h (1)
sinkTokenLength
(47-47)tensorrt_llm/llmapi/llm_args.py (3)
SchedulerConfig
(614-632)PeftCacheConfig
(636-699)ExtendedRuntimePerfKnobConfig
(847-874)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
cpp/tensorrt_llm/nanobind/batch_manager/bindings.cpp (1)
59-80
: Exception handling is well-implemented.The exception registration and translation mechanism properly converts C++ exceptions to Python exceptions, ensuring clean error propagation across the language boundary.
cpp/tensorrt_llm/nanobind/batch_manager/llmRequest.h (3)
18-33
: Header structure and includes look good.The includes are well-organized and appropriate for the nanobind binding functionality. The namespace aliases (
nb
for nanobind andtb
for tensorrt_llm::batch_manager) improve readability.
35-53
: Well-designed class with helpful documentation.The inheritance from
GenericLlmRequest<at::Tensor, c10::Stream>
is appropriate, and the comment about torch's nanobind binding limitations withc10::cuda::CUDAStream
provides valuable context. The type aliases improve code readability.
154-158
: Method declarations are well-designed.The
callbackAdapter
andtoTrtLlm
methods serve clear purposes in the binding layer:
callbackAdapter
handles Python-to-C++ callback conversiontoTrtLlm
provides type conversion to the underlying batch manager- Appropriate const-correctness and return types
PR_Github #12382 [ run ] completed with state |
/bot run |
PR_Github #12389 [ run ] triggered by Bot |
PR_Github #12389 [ run ] completed with state |
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>
Signed-off-by: Linda-Stadter <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>
feat: nanobind bindings
This PR includes:
Building with nanobind
The default build command still uses pybind!
Add
--binding_type nanobind
as a build flag tobuild_wheel.py
to build with nanobind.Example:
python scripts/build_wheel.py --binding_type nanobind --trt_root /usr/local/tensorrt --cuda_architectures "native" -D FAST_BUILD=ON --use_ccache
This creates a python module
bindings
. It is the same as before, therefore are no changes in the Python files required.Porting from pybind to nanobind
I have listed the main differences between pybind11 and nanobind for our code base here. There is also a guide in the nanobind documentation: https://nanobind.readthedocs.io/en/latest/porting.html
PYBIND11_MODULE(..)
NB_MODULE(..)
PYBIND11_OVERRIDE_*(..)
NB_OVERRIDE_*(..)
reinterpret_borrow<T>(x)
borrow<T>(x)
reinterpret_steal<T>(x)
steal<T>(x)
.def_readwrite(..)
.def_rw(..)
.def_readonly(..)
.def_ro(..)
.def_property(..)
.def_prop_rw(..)
.def_property_readonly(..)
.def_prop_ro(..)
register_exception<T>
exception<T>
classh
class_
std::deque
std::deque
. Created a custom caster instead and reached out to owners for support ofstd:deque
.def(py::init([](int) { return MyType(...); }));
.def("__init__", [](MyType *t) { new (t) MyType(...); });
See documentationobject.cast<Type>()
nb::cast<Type>(object)
py::arg_v("param_name", default_value, "doc_string_description")
nb::arg("param_name") = default_value
Replaced numpy array class
py::array
withnb::ndarray<>
. See documentationThis class does not support nonstandard arithmetic types. I overloaded
nanobind::detail::dtype_traits
to support datatypehalf
. See documentationPYBIND11_OVERRIDE_*(..)
with base type and return value argumentsNB_OVERRIDE_*()
without these arguments. The class also requires oneNB_TRAMPOLINE(parent, size)
declaration. See documentationpy::pickle()
to bind__getstate__
and__setstate__
functions__getstate__
and__setstate__
methods. Changed__setstate__
construction to in-place. See documentationThere is a special case in set state function
executorConfigSetState
: The classExecutorConfig
allows to have dynamic attributes (attributes added in Python code). The set state function has to restore the C++ data as well as the attributes from Python. Nanobind then has to be informed inform the instance object is fully constructed via nb::inst_mark_ready()None arguments
Nanobind refuses conversions from
None
tonullptr
by default. In order to exceptNone
in an argument binding: Either call.none()
ofnb::arg
or specify annb::none()
default arguments. See documentationCustom type casters
See documentation
load()
from_python()
with different signaturecast()
from_cpp()
with different signatureat::Tensor
andstd::deque
For the exceptions
PeftTaskNotCachedException
andLoraCacheFullException
, I created custom exceptions withnb::exception<T>
. See documentationSummary by CodeRabbit
New Features
Bug Fixes
Tests
Chores