[nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache #5974

moraxu · 2025-07-11T20:40:05Z

Description

As per NIM request, add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache.

Test Coverage

For now, tested with:

trtllm-serve ${ENGINE_PATH} \
    --tokenizer ${CKPT_PATH} \
    --max_seq_len 100 \
    --kv_cache_free_gpu_memory_fraction 0.001 \
    --fail_fast_on_attention_window_too_large

and:

python3 examples/run.py \
--engine_dir ${ENGINE_PATH} \
--max_output_len 100 \
--tokenizer_dir ${CKPT_PATH} \
--input_text "$LONG_INPUT" \
--kv_cache_free_gpu_memory_fraction 0.001 \
--fail_fast_on_attention_window_too_large

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Added a new configuration option to fail immediately if the attention window is too large to fit even a single sequence in the KV cache. This option is available via command-line flags, Python APIs, server configurations, and runtime parameters.
- Introduced corresponding parameters and documentation in both Python and C++ interfaces, including model runners and executor configurations.
Documentation
- Updated help messages, API docs, and serialization logic to describe and support the new fail-fast behavior.

moraxu · 2025-07-11T20:51:22Z

/bot run

tensorrt-cicd · 2025-07-11T20:56:30Z

PR_Github #11674 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-12T04:19:27Z

PR_Github #11674 [ run ] completed with state ABORTED

moraxu · 2025-07-12T04:20:12Z

/bot run

tensorrt-cicd · 2025-07-12T04:25:59Z

PR_Github #11698 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-12T07:06:58Z

PR_Github #11698 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #8660 completed with status: 'SUCCESS'

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h

cpp/include/tensorrt_llm/executor/executor.h

jaedeok-nvidia

This PR looks good. Especially, it will provide an explicit option for very-limited memory scenario.

examples/run.py

tensorrt_llm/runtime/model_runner_cpp.py

tensorrt_llm/runtime/model_runner.py

tensorrt_llm/llmapi/llm_args.py

tensorrt_llm/commands/serve.py

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.cpp

cpp/include/tensorrt_llm/executor/executor.h

netanel-haber

Left comments suggesting changes to wording of doc comments - at the author's discretion if to resolve without notice or to accept some form of the changes. But otherwise - LGTM.

moraxu · 2025-07-22T17:31:25Z

@nv-guomingz could you review it while Chunwei is out? Thanks

tensorrt_llm/llmapi/llm_args.py

moraxu · 2025-07-24T16:06:58Z

/bot run

tensorrt-cicd · 2025-07-24T16:12:22Z

PR_Github #12877 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-24T22:51:41Z

PR_Github #12877 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9599 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

nv-guomingz

LGTM for llm api part

Signed-off-by: moraxu <[email protected]>

nv-guomingz · 2025-07-25T16:03:49Z

/bot reuse-pipeline

moraxu · 2025-07-25T21:49:34Z

/bot reuse-pipeline

tensorrt-cicd · 2025-07-25T21:54:58Z

PR_Github #13040 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-07-25T22:10:39Z

PR_Github #13040 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #12877 for commit 2e454a2

…tn window is too large to fit at least one sequence in KV cache (NVIDIA#5974) Signed-off-by: moraxu <[email protected]> Signed-off-by: Shreyas Misra <[email protected]>

…tn window is too large to fit at least one sequence in KV cache (NVIDIA#5974) Signed-off-by: moraxu <[email protected]> Signed-off-by: Ransiki Zhang <[email protected]>

Linda-Stadter · 2025-07-30T09:04:11Z

Can you add the changes that you did to pybind/executor/executorConfig.cpp also to nanobind/executor/executorConfig.cpp? Thank you!

moraxu · 2025-07-30T17:33:26Z

Can you add the changes that you did to pybind/executor/executorConfig.cpp also to nanobind/executor/executorConfig.cpp? Thank you!

Added in #6491

…tn window is too large to fit at least one sequence in KV cache (NVIDIA#5974) Signed-off-by: moraxu <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>

moraxu marked this pull request as ready for review July 14, 2025 02:54

moraxu requested review from a team as code owners July 14, 2025 02:54

moraxu requested a review from juney-nvidia July 14, 2025 02:54

QiJune requested review from jaedeok-nvidia and netanel-haber July 15, 2025 02:02

moraxu changed the title ~~[Feature] Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache~~ [nvbug/5374773] chore: Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache Jul 15, 2025

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

cpp/tensorrt_llm/batch_manager/trtGptModelInflightBatching.h Show resolved Hide resolved

jaedeok-nvidia reviewed Jul 15, 2025

View reviewed changes

cpp/include/tensorrt_llm/executor/executor.h Show resolved Hide resolved

jaedeok-nvidia approved these changes Jul 15, 2025

View reviewed changes