-
-
Couldn't load subscription status.
- Fork 10.9k
[Attention] MLA get rid of materialization #14770
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
d02b6b7 to
98cdf57
Compare
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense for the memory savings
This reverts commit 9532c49.
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Richard Liu <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]>
Signed-off-by: Lucas Wilkinson <[email protected]> Signed-off-by: Mu Huai <[email protected]>
hello,may I know what the parallel configuration of the baseline and ur test is like. I am also exploring the comparison before and after enabling matrix absorption preconditioning. |
Based on these calculations:
https://docs.google.com/spreadsheets/d/17eoqEbhblvtNsRRlFSjCQnEXZiBxtLgZGKD4IgZUz38/edit?usp=sharing
It's actually better to just not materialize the absorbed
W_Q_UKandW_UV_Oas it reduces memory usage (and total flops) and instead compute using sequential matmuls. One issue is that we do not have an FP8 bmm (which is needed if not materializing the absorbed matrix, materializing absorbing allowed us to bypass this), so we instead decompress the matrices involved in the bmm to fp16/bf16. This also has the added benefit of dramatically reducing complexity.This PR is needed for DP attention since without it the weight materialization eats up too much of the GPU memory to make DP beneficial.
This PR (minor regression in short context but seems worth it given the saved memory boosts long-context and enables DP-attention, also the short context measurements are a bit noisy)
Baseline (#14769)
correctness tests: