Skip to content

Conversation

@WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented May 2, 2023

This PR refactors attention kernels, making the helper functions more modular and pruning unused code. This PR will make it easier to add support for a new data type such as bfloat16.

In addition, this PR reduces the computation overhead of the attention kernel, by using the reduced precision (i.e., fp16) for logits * V instead of the full precision. This is compatible with the FasterTransformer's implementation.

@WoosukKwon WoosukKwon requested a review from zhuohan123 May 2, 2023 07:38
@@ -0,0 +1,5 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the define guard instead of #pragma once per Google's C++ style guide :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either options have pros and cons. I think it's safe to use #pragma once, because it is commonly used in DL projects such as PyTorch and FasterTransformer.

@WoosukKwon WoosukKwon requested a review from zhuohan123 May 3, 2023 06:27
@WoosukKwon
Copy link
Collaborator Author

WoosukKwon commented May 3, 2023

Performance (batch_size=8, context_len=512, num_heads=40, head_size=128):

Before: 83.4 us
After: 82.5 us

There's slight improvement in the kernel performance due to the use of fp16 in logits * values.

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Jun 18, 2024
yukavio pushed a commit to yukavio/vllm that referenced this pull request Jul 3, 2024
dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024
…kar-amd-patch-1

Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…""
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
wuhuikx pushed a commit to wuhuikx/vllm that referenced this pull request Mar 27, 2025
### What this PR does / why we need it?
1. Add vllm-ascend tutorial doc for Qwen/Qwen2.5-7B-Instruct model
serving doc
2. fix format of files in `docs` dir, e.g. format tables, add underline
for links, add line feed...

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

no.

### How was this patch tested?
doc CI passed

---------

Signed-off-by: Shanshan Shen <[email protected]>
heheda12345 pushed a commit to heheda12345/vllm that referenced this pull request Sep 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants