Skip to content

Conversation

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson LucasWilkinson commented Mar 13, 2025

Enable VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0 for fp8 by just up converting to fp16, also switch to using bmm from einsum to make it more obvious the kernels needed / easier to integrate an fp8 bmm (we would need block-scale support for 64x128, or 128x64, I need to work through it)

Based on these calculations (may be bugged): https://docs.google.com/spreadsheets/d/17eoqEbhblvtNsRRlFSjCQnEXZiBxtLgZGKD4IgZUz38/edit?usp=sharing

VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0 should introduce 143% memory overhead
while:
VLLM_MLA_PERFORM_MATRIX_ABSORPTION=1 (default) should introduce 318% memory overhead

we will likely want to make VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0 the default

with VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0:

lm_eval --model local-completions --tasks gsm8k --model_args model=/home/vllm-dev/DeepSeek-R1,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=5,max_retries=3,tokenized_requests=False --limit 100

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.97|±  |0.0171|
|     |       |strict-match    |     5|exact_match|↑  | 0.97|±  |0.0171|
VLLM_MLA_PERFORM_MATRIX_ABSORPTION=0 VLLM_USE_V1=1

Data Preview:
  backend  input_tokens  output_tokens  output_toks/s     req/s  median_itl_ms  median_ttft_ms
2    vllm          1000           1000    1269.083834  1.269084      31.285999     2318.670340
1    vllm          5000           1000    1046.350954  1.046351      33.370881     5510.511930
3    vllm         10000           1000     865.023539  0.865024      37.076649     8501.588057
0    vllm         32000           1000     190.611408  0.190611      35.992813   107927.603456

VLLM_MLA_PERFORM_MATRIX_ABSORPTION=1 VLLM_USE_V1=1

Data Preview:
  backend  input_tokens  output_tokens  output_toks/s     req/s  median_itl_ms  median_ttft_ms
2    vllm          1000           1000    1379.393797  1.379394      30.282677     2025.573374
1    vllm          5000           1000    1038.084492  1.038084      33.778368     5517.865633
3    vllm         10000           1000     571.708739  0.571709      36.978330     8523.583977
0    vllm         32000           1000     161.668698  0.161669      43.343701   115767.377675

Signed-off-by: Lucas Wilkinson <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 13, 2025
@robertgshaw2-redhat robertgshaw2-redhat deleted the lwilkinson/fp8-no-materialize branch March 24, 2025 18:04
@robertgshaw2-redhat robertgshaw2-redhat restored the lwilkinson/fp8-no-materialize branch March 24, 2025 18:06
@LucasWilkinson
Copy link
Collaborator Author

superseded by: #14770

@LucasWilkinson LucasWilkinson deleted the lwilkinson/fp8-no-materialize branch March 24, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants