-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side #7624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-7731][feat] KV cache transmission in disagg with CP on gen side #7624
Conversation
📝 WalkthroughWalkthroughExtends MLA KV-cache formatting/split/concat to support CP (context parallel) domain alongside PP/TP, updates buffer sizing and indexing, adds environment-gated debug logging, relaxes a CP constraint, enhances tensor print for 4D, and augments multi-GPU tests for CP-aware scenarios with additional rank decomposition and validations. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Host as Host (Transceiver)
participant TR as TargetRanksInfo
participant Kern as splitKVCacheForMLAKernel
participant Buff as Output Buffers (PP×CP)
note over Host,TR: CP-aware MLA split path
Host->>TR: Query DomainPPSize, DomainTPSize, DomainCPSize<br/>and peer layer/block metadata
TR-->>Host: Sizes and indexing info
Host->>Host: Compute output cache count = PP×CP<br/>Validate IRanks size = PP×TP×CP
Host->>Buff: Allocate/prepare buffers per peer (PP×CP)
Host->>Kern: Launch splitKVCacheForMLAKernel(..., DomainPPSize, DomainTPSize, domainCPSize, ...)
rect rgba(200, 235, 255, 0.25)
note right of Kern: CP-aware indexing<br/>outputCacheIdx = (blockId % CP) * PP + rankInPP<br/>offset uses blockIdInDomainCP
end
Kern-->>Buff: Write K/V slices to PP×CP outputs
Buff-->>Host: Buffers ready for transmission
sequenceDiagram
autonumber
participant FM as MLACacheFormatter
participant Peers as Peers (PP×CP)
participant Buf as Send/Recv Buffers
note over FM,Peers: CP-extended formatting
FM->>FM: Read topology (PP, CP), peer layer counts
FM->>FM: Compute per-peer block counts (getBlockNumAccountingForCP)
FM->>Buf: Allocate send/recv buffers for PP×CP targets
FM->>Peers: Distribute/collect cache blocks with CP indexing
opt Debug (TLLM_DEBUG_RANK)
FM->>FM: Print inputKvCacheBlocks / outputBuffers snapshots
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (1)
1034-1034
: Critical: Verify CP-aware token ID calculationThe token ID calculation changes significantly with CP support. The fill function uses
blockId * tokensPerBlock
while verify uses(blockId * mCpSize + mCpRank) * tokensPerBlock
. This asymmetry needs verification.The token ID calculation differs between fill and verify functions:
- Fill:
int startTokenId = blockId * tokensPerBlock;
- Verify:
int startTokenId = (blockId * mCpSize + mCpRank) * tokensPerBlock;
This appears intentional for CP-aware distribution, but should be validated:
#!/bin/bash # Search for other startTokenId calculations to ensure consistency rg "startTokenId\s*=" --type cpp -B 2 -A 2 # Check if there's documentation about this token distribution strategy rg "CP.*token.*distribution|token.*CP.*distribution" --type cppAlso applies to: 1141-1141
🧹 Nitpick comments (15)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h (2)
48-52
: Good: const-correctness and bounds-checked accessMaking getPeerPPDomainLayerNum() const and switching to at() improves safety without API friction.
If practical, add a lightweight precondition to guard mDomainPPSize > 0 to avoid UB on modulo when misconfigured.
int getPeerPPDomainLayerNum(int targetRankIdx) const { - int ppDomainRankIdx = targetRankIdx % mDomainPPSize; + // Defensive: ensure domain size is valid before modulo + // (consider TLLM_CHECK if available here) + int ppDomainRankIdx = targetRankIdx % mDomainPPSize; return mPeerAttentionLayerNumInDomainPP.at(ppDomainRankIdx); }
24-33
: Header hygiene: include the standard headers you useThis header uses std::vector but doesn’t include . Prefer self-contained headers. Also, coding-guidelines ask for include guards; consider adding them (or confirm project-wide exception).
Proposed additions (outside the changed hunk):
#pragma once +// Prefer include guards per guidelines if applicable: +// #ifndef TRTLLM_CACHE_SPLIT_CONCAT_H +// #define TRTLLM_CACHE_SPLIT_CONCAT_H + +#include <vector> ... -} // namespace tensorrt_llm::executor::kv_cache +} // namespace tensorrt_llm::executor::kv_cache + +// #endif // TRTLLM_CACHE_SPLIT_CONCAT_HAlso applies to: 35-47
cpp/tensorrt_llm/runtime/iTensor.cpp (1)
197-213
: 4D print fallback can misformat when shape.d[3] == 1For 4D tensors with the last dim == 1, the code falls through to the 2D branch and ignores d[2], printing a contiguous slice of length d[1], which is misleading. Add an explicit 4D path for d[3] == 1.
Apply before the current “nbDims >= 2” branch:
- else if (shape.nbDims == 4 && shape.d[3] > 1) + else if (shape.nbDims == 4 && shape.d[3] > 1) { ... } + else if (shape.nbDims == 4 && shape.d[3] == 1) + { + for (int i = 0; i < shape.d[0]; ++i) + { + for (int j = 0; j < shape.d[1]; ++j) + { + out << "i=" << i << " j=" << j << ": "; + // Print d[2] scalars at k = 0..d[2]-1, d[3] fixed at 0 + for (int k = 0; k < shape.d[2]; ++k) + { + auto const idx = tc::flat_index(shape.d, i, j, k, 0); + // Print as a single value followed by space + out << static_cast<TOutput>(hostData[idx]) << (k + 1 < shape.d[2] ? " " : ""); + } + out << "\n"; + } + } + }Please confirm this matches your intended visualization for 4D tensors where the trailing dimension is degenerate.
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (5)
573-577
: MLA kernel signature extended with domainCPSize — good, but assert invariantsAdding domainCPSize is correct for CP-aware layout. Add a device-side assert that domainCPSize > 0 to protect modulo/division.
-__global__ void splitKVCacheForMLAKernel(..., int DomainTPSize, int domainCPSize, int kvFactor, uint64_t* prefixLayerNumDevPtr) +__global__ void splitKVCacheForMLAKernel(..., int DomainTPSize, int domainCPSize, int kvFactor, uint64_t* prefixLayerNumDevPtr) { + assert(domainCPSize > 0);
606-614
: CP-aware output index assumes round-robin block distributionUsing (blockId % domainCPSize) to select the CP group and blockId / domainCPSize for the per-CP offset encodes a strict round-robin mapping. This can misaddress outputs if getBlockNumAccountingForCP(strict=false) yields uneven assignment (noted in PR). At minimum, guard with bounds in host sizing and add a TODO here referencing the distributor to avoid silent OOB writes.
- Validate at host: each output cache has capacity >= ceil(inputBlockNumSum / domainCPSize) blocks.
- Alternatively, pass a precomputed blockIdToCpRank/table from host to device when non-strict mode is enabled.
Example TODO comment:- // We do blockId % domainCPSize because blocks are distributed among cpRanks in a round-robin fashion. + // We do blockId % domainCPSize assuming round-robin CP distribution. + // TODO(TRTLLM-7731): If non-strict distribution is active, replace with an index map provided by host. int outputCacheIdx = (blockId % domainCPSize) * DomainPPSize + rankInDomainPP; ... - int const blockIdInDomainCP = blockId / domainCPSize; + int const blockIdInDomainCP = blockId / domainCPSize; // relies on round-robinAlso applies to: 618-621
1151-1158
: Deriving selfPPRank with CP accounted — good; include CP in debug logselfPPRank now divides by TP×CP, which is correct. Consider adding domainCPSize to the debug log for easier triage.
- TLLM_LOG_DEBUG( - "splitKVCache - numLayers: %d, headNum: %d, domainPPSize: %d, domainTPSize: %d, " - "headsPerDomainTP: %d", - numLayers, headNum, DomainPPSize, DomainTPSize, headNumDomainTP); + TLLM_LOG_DEBUG( + "splitKVCache - numLayers: %d, headNum: %d, domainPPSize: %d, domainTPSize: %d, domainCPSize: %d, " + "headsPerDomainTP: %d", + numLayers, headNum, DomainPPSize, DomainTPSize, domainCPSize, headNumDomainTP);Also applies to: 1163-1167
90-97
: Pointer table buffers typed as INT64 — minor type-safety nitYou allocate the pointer tables with DataType::kINT64 and assert sizeInBytes vs sizeof(T*). That’s fine on 64-bit, but subtly couples to pointer size. Consider using a dedicated byte buffer (kUINT8) and computing sizes via sizeof(void*) for clarity.
Also applies to: 107-110, 1430-1432
18-36
: Missing include for std::accumulateThis TU uses std::accumulate; add explicitly to keep the file self-contained.
#include <sstream> +#include <numeric> #include <string> #include <vector>
Also applies to: 107-110
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (4)
43-53
: Enhance debug rank environment variable documentationThe
getEnvMpiDebugRank()
function could benefit from better documentation about the expected values and their meanings.+// Returns debug rank from TLLM_DEBUG_RANK environment variable +// -1: debug all ranks +// -2: no debug output (default) +// >=0: debug specific rank int getEnvMpiDebugRank() { - // Look-up env variable TLLM_DEBUG_RANK. char const* const env = std::getenv("TLLM_DEBUG_RANK"); if (env == nullptr) { - return -2; // -1 means all ranks, -2 means no debug rank. + return -2; // Default: no debug output } return std::stoi(env); }
155-159
: Refactor repeated debug logging patternThe debug logging pattern for MPI rank is repeated multiple times. Consider extracting it into a helper function or macro.
-static const int TARGET_RANK = getEnvMpiDebugRank(); // -1 means all ranks. -if (TARGET_RANK == -1 || mpi::MpiComm::world().getRank() == TARGET_RANK) +auto const shouldLogDebug = [](int targetRank = getEnvMpiDebugRank()) { + return targetRank == -1 || mpi::MpiComm::world().getRank() == targetRank; +}; +if (shouldLogDebug()) { std::cerr << "[mpiRank:" << mpi::MpiComm::world().getRank() << "]" << "[MLACacheFormatter::format] inputKvCacheBlocks[" << blockNum << "]: \n" << *it << std::endl; }
552-562
: Remove or gate verbose debug output appropriatelyThe debug output for all output buffers could be very verbose in production. Consider using a more specific debug flag or removing it.
static const int TARGET_RANK = getEnvMpiDebugRank(); // -1 means all ranks. if (TARGET_RANK == -1 || mpi::MpiComm::world().getRank() == TARGET_RANK) { bufferManager.getStream().synchronize(); - int blockNum = 0; - for (auto const& block : outputBuffers) - { - std::cerr << "[mpiRank:" << mpi::MpiComm::world().getRank() << "]" << "[MLACacheFormatter::format] outputBuffers[" << blockNum << "]: \n" << *block << std::endl; - blockNum++; - } + TLLM_LOG_DEBUG("Completed concat of %zu output buffers for rank %d", + outputBuffers.size(), mpi::MpiComm::world().getRank()); }
409-409
: Fix comment typoMinor typo in the comment.
-// @B: Maybe no updates are needed because contextCP is always 1? +// Note: Maybe no updates are needed because contextCP is always 1?cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
540-543
: Improve test skip condition with clear messageThe test skip condition could provide more informative feedback about why the test is being skipped.
if (tensorrt_llm::mpi::MpiComm::world().getSize() < nprocs) { - GTEST_SKIP() << "mpirun with procs=" << nprocs << " is required to run this test."; + GTEST_SKIP() << "Test requires " << nprocs << " MPI processes (contextTp=" << contextTp + << " * contextPp=" << contextPp << " * contextCp=" << contextCp + << " + genTp=" << genTp << " * genPp=" << genPp << " * genCp=" << genCp + << "), but only " << tensorrt_llm::mpi::MpiComm::world().getSize() << " available"; }
1056-1068
: Consider extracting verbose logging to a separate debug utilityThe verbose per-value logging in fillBlockData could impact performance even when disabled. Consider moving to a separate debug utility class.
Consider creating a debug utility class to handle verbose logging:
class KVCacheDebugLogger { public: static bool shouldLog(int targetRank = getEnvMpiDebugRank()) { return targetRank == -1 || mpi::MpiComm::world().getRank() == targetRank; } static void logBlockValue(const char* op, int blockId, int layerId, int headId, int tokenId, int hiddenId, size_t index, double value, nvinfer1::DataType dataType, bool isKey) { if (!shouldLog()) return; // logging implementation } };Also applies to: 1080-1092
1320-1320
: Consider adding more iterations for cache reuse testingThe loop is set to only 1 iteration, which doesn't test cache reuse as mentioned in the comment.
// the second loop is for cache reuse -for (int i = 0; i < 1; i++) +for (int i = 0; i < 2; i++) // Test both initial use and cache reuse
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
(7 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
(1 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
(10 hunks)cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
(1 hunks)cpp/tensorrt_llm/runtime/iTensor.cpp
(1 hunks)cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
(15 hunks)
🧰 Additional context used
📓 Path-based instructions (7)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh}
: Namespace closing braces must include a trailing comment with the namespace name (e.g., '} // namespace foo').
Prefer const or constexpr variables over #define for constants.
Declare variables that are not modified after initialization as const.
Avoid magic literals in code; except for 0, nullptr, true, false. Use named constants for comparisons and logic.
Use Allman brace style for formatting.
Place the semicolon of an empty for/while loop on a new line.
Bodies of switch/while/do-while/for must be compound statements (brace-delimited), and if/else must always be followed by brace-delimited statements.
Type names (e.g., classes) must be CamelCase starting with an uppercase letter (e.g., FooBar).
Local variables, methods, and namespaces use lowerCamelCase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not in an anonymous namespace must be lowerCamelCase prefixed with 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number globals that are static or in an anonymous namespace use lowerCamelCase prefixed with 's' (e.g., sMutableStaticGlobal).
Locally visible static variables use lowerCamelCase with 's' prefix (e.g., static std::once_flag sFlag).
Private/protected member variables use 'm' prefix with CamelCase (e.g., mNbFooValues). Public members may omit, but 'm' is encouraged for clarity.
Constants (enums, global constants, static constants, and function-scope magic/literal constants) use uppercase SNAKE_CASE with 'k' prefix (e.g., kDIGIT_NUM).
Function-scope constants that are not magic numbers or literals are named like non-constant variables (e.g., bool const pass = a && b).
If macros are necessary, name them in UPPER_SNAKE_CASE (e.g., FOO_VERSION) and prefer constants over #define.
Use LLVM clang-format; wrap lines at a maximum of 120 columns; use '// clang-format off/on' sparingly with justification.
Use smart pointers for heap allocations; prefer unique_ptr for sole ownership, shared_ptr for shared...
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
cpp/tensorrt_llm/runtime/iTensor.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{cpp,cxx,cc,cu,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
C++ filenames should be lowerCamelCase (first letter lowercase) and must be case-insensitive unique within a compilation target.
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
cpp/tensorrt_llm/runtime/iTensor.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
cpp/tensorrt_llm/runtime/iTensor.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
**/*.{h,hpp,hh,hxx}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Document new class interfaces and function prototypes with Doxygen; use //! for single-line and //!< for members.
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc}
: Prefer anonymous namespaces over 'static' for internal linkage of functions.
All templates (class/function/member/static) must be instantiated at least once; non-POD classes should have private data members.
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
cpp/tensorrt_llm/runtime/iTensor.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
**/*.{h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use include guards named 'TRTLLM_<FILE_NAME_IN_CAPS_WITH_UNDERSCORES>_H' (no leading or trailing underscore; directory names excluded).
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.h
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.h
cpp/tensorrt_llm/runtime/iTensor.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
cpp/tensorrt_llm/executor/cache_transmission/cacheSplitConcat.cu (4)
1032-1038
: LGTM: CP included in IRank sizing and MLA output cache countThe checks now enforce |IRanks| = PP×TP×CP and, for MLA, outputCacheNum = PP×CP, which matches the kernel’s expectations.
1175-1182
: Host wiring passes domainCPSize for all vec widths — consistentAll splitKVCacheForMLAKernel instantiations propagate domainCPSize. Good coverage across 16/8/4/2/1-byte paths.
Also applies to: 1199-1206, 1224-1231, 1258-1265, 1287-1294
1372-1380
: Concat path keeps CP collapsed for MLA — verify symmetrical with splitconcatKVCache() keeps inputCacheNum = PP for MLA (no CP), which matches your “gen side only” scope. Please confirm tests cover mixed PP×CP on sender and PP-only on receiver to avoid off-by-one in cache indexing.
1477-1487
: Repeat: selfPPRank with CP accounted — goodSame correctness note as in splitKVCache.
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
195-209
: CP domain calculation is consistent across codebase
PP rank divisor and buffer size logic correctly incorporate mContextParallelism and getBlockNumAccountingForCP in all instances; no changes needed.cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp (3)
953-961
: Good defensive programming for over-allocation protectionThe
isBlockOverallocated
function properly handles the known limitation of over-allocation in CP-enabled generation ranks. The implementation correctly identifies blocks that exceed the strict CP accounting.
979-983
: LGTM: Proper handling of over-allocated blocksThe generation verification correctly skips over-allocated blocks with appropriate logging. This aligns with the PR's known limitations.
1553-1570
: Good test coverage for CP scenariosThe new test instantiation
AsymmetricCaseTestWithCPForMLA
provides comprehensive coverage for CP-enabled MLA scenarios with various configurations. This aligns well with the PR objectives.
a95c0a6
to
019f8c1
Compare
/bot run |
PR_Github #18090 [ run ] triggered by Bot |
019f8c1
to
8ef531f
Compare
/bot run |
PR_Github #18098 [ run ] triggered by Bot |
PR_Github #18090 [ run ] completed with state |
PR_Github #18098 [ run ] completed with state |
/bot run |
PR_Github #18115 [ run ] triggered by Bot |
PR_Github #18115 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #18243 [ run ] triggered by Bot |
b69a4ef
to
65ea160
Compare
/bot run --disable-fail-fast |
PR_Github #18318 [ run ] triggered by Bot |
PR_Github #18318 [ run ] completed with state |
PR_Github #18833 [ run ] triggered by Bot |
PR_Github #18833 [ run ] completed with state |
735a31d
to
1389582
Compare
/bot run --disable-fail-fast |
PR_Github #18850 [ run ] triggered by Bot |
PR_Github #18850 [ run ] completed with state |
Signed-off-by: Balaram Buddharaju <[email protected]>
1389582
to
9229e04
Compare
/bot run --disable-fail-fast |
PR_Github #19027 [ run ] triggered by Bot |
/bot run --disable-fail-fast |
PR_Github #19037 [ run ] triggered by Bot |
PR_Github #19027 [ run ] completed with state |
/bot run |
PR_Github #19052 [ run ] triggered by Bot |
PR_Github #19037 [ run ] completed with state |
PR_Github #19052 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #19209 [ run ] triggered by Bot |
PR_Github #19209 [ run ] completed with state |
/bot run |
1 similar comment
/bot run |
PR_Github #19354 [ run ] triggered by Bot |
/bot run |
PR_Github #19368 [ run ] triggered by Bot |
PR_Github #19368 [ run ] completed with state |
…de (NVIDIA#7624) Signed-off-by: Balaram Buddharaju <[email protected]>
…de (NVIDIA#7624) Signed-off-by: Balaram Buddharaju <[email protected]>
Description
This MR adds functionality for KV cache transmission with context parallelism on gen side in disaggregated serving.
Current scope:
MLACacheFormatter::unformat()
andconcatKvCacheV2Dispatch()
in this MR. Those would be needed if prefill also has CP.Known limitations to be addressed in a future MR:
getBlockNumAccountingForCP()
which allows for additional block allocation while sending. This will also be addressed when previous limitation is taken care of.Couple of design decisions:
For example, let's say seq1 has 4 blocks while seq2 has 6 blocks (first 4 being same as seq1).
Round-robin:
seq1 cache:
cpRank0 - block0, block2
cpRank1 - block1, block3
seq2 cache:
cpRank0 - block0, block2, block4 (first 2 reused from seq1)
cpRank1 - block1, block3, block5 (first 2 reused from seq1)
Contiguous:
seq1 cache:
cpRank0 - block0, block1
cpRank1 - block2, block3
seq2 cache:
cpRank0 - block0, block1, block2 (first 2 reused from seq1)
cpRank1 - block3, block4, block5 (no reuse)
Please let me know if something sounds amiss.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Tests