[Runtime][KVCache] AttentionWithFusedQKV and RoPE mode #16456

MasterJH5574 · 2024-01-23T15:50:25Z

This PR introduces two changes to the (paged) KV cache:

The first is introducing RoPE mode to PagedKVCache. Right now there are two modes: normal/inline. In "normal" mode, RoPE will be applied to input Q/K/V data before appending the K/V data to cache. In "inline" mode, the input K/V data is directly appending to cache, and the RoPE will be on-the-fly applied inside attention kernel. The main purpose of introducing RoPE mode is to balance the need of on-the-fly RoPE (in cases like Mistral where positions can cahnge) and the attention kernel performance.

The second is introducing a new interface AttentionWithFusedQKV to KV cache. This function takes the input QKV data that is fused along the head dimension. And the fused QKV will be split into separate Q/K/V internally (note: requiring external workspace passed in). We introduce this function since in practice we note that when RoPE mode is "normal," it offers better performance if we fuse the QKV split and RoPE application.

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

This PR introduces two changes to the (paged) KV cache: The first is introducing RoPE mode to PagedKVCache. Right now there are two modes: normal/inline. In "normal" mode, RoPE will be applied to input Q/K/V data before appending the K/V data to cache. In "inline" mode, the input K/V data is directly appending to cache, and the RoPE will be on-the-fly applied inside attention kernel. The main purpose of introducing RoPE mode is to balance the need of on-the-fly RoPE (in cases like Mistral where positions can cahnge) and the attention kernel performance. The second is introducing a new interface `AttentionWithFusedQKV` to KV cache. This function takes the input QKV data that is fused along the head dimension. And the fused QKV will be split into separate Q/K/V internally (note: requiring external workspace passed in). We introduce this function since in practice we note that when RoPE mode is "normal," it offers better performance if we fuse the QKV split and RoPE application.

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

9d3fd8e

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 mentioned this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV mlc-ai/mlc-llm#1650

Merged

MasterJH5574 marked this pull request as draft January 23, 2024 16:40

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

c585253

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

fadc149

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 force-pushed the tvm-dev/2024-01-23-kv-cache-rope-mode branch from 12ad561 to 95ccc52 Compare January 23, 2024 20:44

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

613fb56

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

a665a42

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

2062ca4

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

d9afdb2

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

8fcf37a

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

d28d8f3

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

9a551f3

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 23, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

b77f7f3

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 marked this pull request as ready for review January 23, 2024 21:17

tqchen approved these changes Jan 23, 2024

View reviewed changes

tqchen merged commit 20b08a5 into apache:main Jan 24, 2024

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 24, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

5432e22

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to MasterJH5574/mlc-llm that referenced this pull request Jan 25, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV

e3194b6

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

MasterJH5574 added a commit to mlc-ai/mlc-llm that referenced this pull request Jan 25, 2024

[Serving] Use RoPE mode and AttentionWithFusedQKV (#1650)

59fdfc4

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

ysh329 mentioned this pull request Apr 21, 2024

[Release] v0.16.0 Release Candidate Notes #16911

Closed

smickey040404 added a commit to smickey040404/mlc-llm that referenced this pull request Feb 11, 2025

[Serving] Use RoPE mode and AttentionWithFusedQKV (#1650)

7cfdc63

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

tristankincaid added a commit to tristankincaid/mlc-llm that referenced this pull request Feb 16, 2025

[Serving] Use RoPE mode and AttentionWithFusedQKV (#1650)

e760a8e

Following apache/tvm#16456, this PR leverages the RoPE mode and AttentionWithFusedQKV function in llama.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Runtime][KVCache] AttentionWithFusedQKV and RoPE mode #16456

[Runtime][KVCache] AttentionWithFusedQKV and RoPE mode #16456

MasterJH5574 commented Jan 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Runtime][KVCache] AttentionWithFusedQKV and RoPE mode #16456

[Runtime][KVCache] AttentionWithFusedQKV and RoPE mode #16456

Conversation

MasterJH5574 commented Jan 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants