[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) #5475

jmydurant · 2025-06-25T09:37:43Z

Description

support nvfp4 model and fp8 kv cache for MLA chunked prefill. It contains some commits based on #4651

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

jmydurant · 2025-06-25T09:38:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-25T09:47:34Z

PR_Github #9861 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T19:51:11Z

PR_Github #9861 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7274 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

PerkzZheng · 2025-06-26T01:44:36Z

@jmydurant can you help rebase this branch to the latest main ? it seems to contain many commits from #4651

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant · 2025-06-26T04:01:06Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-26T04:06:44Z

PR_Github #9962 [ run ] triggered by Bot

examples/models/core/deepseek_v3/README.md

tensorrt_llm/_torch/attention_backend/interface.py

tensorrt-cicd · 2025-06-26T07:49:57Z

PR_Github #9962 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7350 completed with status: 'SUCCESS'

tests/integration/defs/accuracy/test_llm_api_pytorch.py

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant · 2025-06-26T09:40:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-26T09:47:26Z

PR_Github #10022 [ run ] triggered by Bot

tensorrt_llm/_torch/attention_backend/interface.py

tests/integration/defs/accuracy/test_llm_api_pytorch.py

tensorrt-cicd · 2025-06-26T12:43:39Z

PR_Github #10022 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7394 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant · 2025-06-26T13:25:28Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-26T13:31:03Z

PR_Github #10037 [ run ] triggered by Bot

kaiyux · 2025-06-26T14:06:38Z

/bot skip --comment "the last commit is just modifying comment, no need to rerun pipeline"

tensorrt-cicd · 2025-06-26T14:12:14Z

PR_Github #10042 [ skip ] triggered by Bot

tensorrt-cicd · 2025-06-26T14:12:16Z

PR_Github #10037 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-26T14:18:06Z

PR_Github #10042 [ skip ] completed with state SUCCESS
Skipping testing for commit 3604b9c

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

kaiyux mentioned this pull request Jun 26, 2025

feat: chunked prefill for MLA (Blackwell) #4651

Merged

kaiyux requested review from hlu1, yuxianq, zhhuang-nv and PerkzZheng June 26, 2025 01:16

jmydurant added 5 commits June 26, 2025 11:28

draft: support fp8 kvcache for mla chunked prefill

004e15a

Signed-off-by: Mingyang Jiang <[email protected]>

test: add cpp unit test for fp8 kvcache chunked prefill mla

fbf696e

Signed-off-by: Mingyang Jiang <[email protected]>

fix: fix bugs and pass cpp UT

895c31b

Signed-off-by: Mingyang Jiang <[email protected]>

fix: fix bugs and pass pytorch accuracy test

673addf

Signed-off-by: Mingyang Jiang <[email protected]>

chore: modify test case and update some code by latest PR comments

295b1ac

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant force-pushed the user/WIP/fp8kvcache branch from 9f3e9e3 to 295b1ac Compare June 26, 2025 03:28

chore: disable chunked prefill for other GPU except Blackwell

576b1fa

Signed-off-by: Mingyang Jiang <[email protected]>

kaiyux reviewed Jun 26, 2025

View reviewed changes

examples/models/core/deepseek_v3/README.md Show resolved Hide resolved

tensorrt_llm/_torch/attention_backend/interface.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/attention_backend/interface.py Outdated Show resolved Hide resolved

PerkzZheng reviewed Jun 26, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated Show resolved Hide resolved

kaiyux marked this pull request as ready for review June 26, 2025 08:21

kaiyux requested review from a team as code owners June 26, 2025 08:21

kaiyux changed the title ~~[TRTLLM-3602][feat]Draft: support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)~~ [TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) Jun 26, 2025

kaiyux reviewed Jun 26, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_pytorch.py Outdated Show resolved Hide resolved

chore: remove unused params, modify test file

e391ce8

Signed-off-by: Mingyang Jiang <[email protected]>

kaiyux approved these changes Jun 26, 2025

View reviewed changes

PerkzZheng approved these changes Jun 26, 2025

View reviewed changes

yuxianq reviewed Jun 26, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/interface.py Outdated Show resolved Hide resolved

yuxianq reviewed Jun 26, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_pytorch.py Show resolved Hide resolved

chore: modify chunk size description

3604b9c

Signed-off-by: Mingyang Jiang <[email protected]>

yuxianq approved these changes Jun 26, 2025

View reviewed changes

kaiyux enabled auto-merge (squash) June 26, 2025 14:07

kaiyux merged commit 8836990 into NVIDIA:main Jun 26, 2025
3 checks passed

akhoroshev mentioned this pull request Jul 3, 2025

[feature request] Chunked Prefill for MLA SM90 #5708

Open

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

6be8f61

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

0faefaf

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

074c30a

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

c6444b2

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

c77a273

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

f47e24e

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

df69978

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chun…

a820cee

…ked prefill (Blackwell) (NVIDIA#5475) Signed-off-by: Mingyang Jiang <[email protected]>

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) #5475

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) #5475

Uh oh!

Conversation

jmydurant commented Jun 25, 2025

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

jmydurant commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

PerkzZheng commented Jun 26, 2025

Uh oh!

jmydurant commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

jmydurant commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

jmydurant commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

kaiyux commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

tensorrt-cicd commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!