feat: chunked prefill for MLA (Blackwell) #4651

jmydurant · 2025-05-26T04:13:03Z

[TRTLLM-3602][feat]Draft: chunked prefill for MLA (Blackwell)

Description

This PR is to support chunked context for MLA. In order to save GPU memory, we need make KV cache into piece of small chunk for each round and merge the attention output with lse.

This is a draft version and under construction

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

poweiw · 2025-06-05T20:30:41Z

Hello @jmydurant! Please ignore if I'm wrong but could you finish the NVIDIA github onboarding process? We're not seeing you in the NVIDIA members and this will cause false failures for community identifiers.

tensorrt_llm/_torch/modules/attention.py

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

tensorrt_llm/_torch/modules/attention.py

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py

cpp/tensorrt_llm/kernels/mlaChunkedPrefill.cuh

cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp

tensorrt_llm/_torch/modules/attention.py

kaiyux · 2025-06-17T01:28:55Z

/bot run

tensorrt-cicd · 2025-06-17T01:34:55Z

PR_Github #9079 [ run ] triggered by Bot

jmydurant · 2025-06-18T06:53:37Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-18T06:56:06Z

PR_Github #9324 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-18T06:58:47Z

PR_Github #9327 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-18T06:58:49Z

PR_Github #9324 [ run ] completed with state ABORTED

jmydurant · 2025-06-18T08:45:34Z

/bot kill

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant · 2025-06-25T02:44:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-25T02:54:37Z

PR_Github #9793 [ run ] triggered by Bot

tensorrt_llm/_torch/attention_backend/interface.py

kaiyux · 2025-06-25T03:49:30Z

@NVIDIA/trt-llm-torch-devs can you help review this PR as well? Thanks.

tests/integration/defs/accuracy/test_llm_api_pytorch.py

jmydurant · 2025-06-25T05:41:12Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-25T05:46:23Z

PR_Github #9810 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T05:46:24Z

PR_Github #9793 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-06-25T11:04:06Z

PR_Github #9810 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #7237 completed with status: 'FAILURE'

jmydurant · 2025-06-25T11:15:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-06-25T11:21:21Z

PR_Github #9872 [ run ] triggered by Bot

tensorrt-cicd · 2025-06-25T14:53:42Z

PR_Github #9872 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #7284 completed with status: 'SUCCESS'

Shang-Pin · 2025-06-25T21:44:39Z

Very excited of chunked prefill support for mla. I want to report a bug I saw when I testing this branch. The first request will return successfully, but the subsequent requests will never terminate and continue running forever.

Edit: I think it might be caused by enabling fp8 kv cache, it works when disabled.

kaiyux · 2025-06-26T01:00:34Z

Very excited of chunked prefill support for mla. I want to report a bug I saw when I testing this branch. The first request will return successfully, but the subsequent requests will never terminate and continue running forever.

Edit: I think it might be caused by enabling fp8 kv cache, it works when disabled.

@Shang-Pin Thanks a lot for your attention and help reporting the issue! The fp8 kv cache with chunked MLA support on Blackwell is going to be added in #5475.

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant force-pushed the user/mingyangj/mlaChunkedPrefill branch 2 times, most recently from 76c9775 to 4241061 Compare May 29, 2025 03:59

poweiw added the Community want to contribute PRs initiated from Community label Jun 5, 2025

poweiw removed the Community want to contribute PRs initiated from Community label Jun 6, 2025

jmydurant force-pushed the user/mingyangj/mlaChunkedPrefill branch 2 times, most recently from 73d5c79 to 13451ea Compare June 11, 2025 10:45

kaiyux marked this pull request as ready for review June 12, 2025 07:01

kaiyux requested review from a team as code owners June 12, 2025 07:01

kaiyux requested review from dongxuy04 and juney-nvidia June 12, 2025 07:01

kaiyux changed the title ~~Draft: chunked prefill for MLA (Blackwell)~~ feat: chunked prefill for MLA (Blackwell) Jun 12, 2025

kaiyux requested review from zongfeijing, PerkzZheng, yuxianq and zhhuang-nv June 12, 2025 07:01

kaiyux reviewed Jun 13, 2025

View reviewed changes

tensorrt_llm/_torch/modules/attention.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py Outdated Show resolved Hide resolved

PerkzZheng reviewed Jun 13, 2025

View reviewed changes

tensorrt_llm/_torch/modules/attention.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/modules/attention.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py Outdated Show resolved Hide resolved

PerkzZheng approved these changes Jun 13, 2025

View reviewed changes

zhhuang-nv reviewed Jun 13, 2025

View reviewed changes

jmydurant force-pushed the user/mingyangj/mlaChunkedPrefill branch from 13451ea to c0c6398 Compare June 16, 2025 04:46

jmydurant force-pushed the user/mingyangj/mlaChunkedPrefill branch from 6f8f2b0 to 6eec709 Compare June 18, 2025 05:45

jmydurant added 2 commits June 25, 2025 10:42

fix: fix after rebase, see 13eef64 for details

8f7e58b

Signed-off-by: Mingyang Jiang <[email protected]>

fix: correct test list

7da4988

Signed-off-by: Mingyang Jiang <[email protected]>

jmydurant force-pushed the user/mingyangj/mlaChunkedPrefill branch from 66ab129 to 7da4988 Compare June 25, 2025 02:43

kaiyux reviewed Jun 25, 2025

View reviewed changes

tensorrt_llm/_torch/attention_backend/interface.py Show resolved Hide resolved

tensorrt_llm/_torch/attention_backend/interface.py Show resolved Hide resolved

kaiyux reviewed Jun 25, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_pytorch.py Show resolved Hide resolved

tests/integration/defs/accuracy/test_llm_api_pytorch.py Show resolved Hide resolved

jmydurant mentioned this pull request Jun 25, 2025

[TRTLLM-3602][feat] support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell) #5475

Merged

kaiyux approved these changes Jun 25, 2025

View reviewed changes

QiJune approved these changes Jun 26, 2025

View reviewed changes

kaiyux merged commit 578dbc8 into NVIDIA:main Jun 26, 2025
3 checks passed

akhoroshev mentioned this pull request Jul 3, 2025

[feature request] Chunked Prefill for MLA SM90 #5708

Open

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 9, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

88e7862

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

fc1f76a

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

7b96d08

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

bb1a56e

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 10, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

fae2314

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

5076b18

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

4d998ed

Signed-off-by: Mingyang Jiang <[email protected]>

dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Jul 11, 2025

feat: chunked prefill for MLA (Blackwell) (NVIDIA#4651)

596922c

Signed-off-by: Mingyang Jiang <[email protected]>

feat: chunked prefill for MLA (Blackwell) #4651

feat: chunked prefill for MLA (Blackwell) #4651

Uh oh!

Conversation

jmydurant commented May 26, 2025

[TRTLLM-3602][feat]Draft: chunked prefill for MLA (Blackwell)

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

poweiw commented Jun 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiyux commented Jun 17, 2025

Uh oh!

tensorrt-cicd commented Jun 17, 2025

Uh oh!

jmydurant commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 18, 2025

Uh oh!

tensorrt-cicd commented Jun 18, 2025

Uh oh!

jmydurant commented Jun 18, 2025

Uh oh!

jmydurant commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

kaiyux commented Jun 25, 2025

Uh oh!

Uh oh!

Uh oh!

jmydurant commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

jmydurant commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

tensorrt-cicd commented Jun 25, 2025

Uh oh!

Shang-Pin commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaiyux commented Jun 26, 2025

Uh oh!

Uh oh!

Uh oh!

Shang-Pin commented Jun 25, 2025 •

edited

Loading