[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

mgoin · 2024-06-28T23:11:23Z

This work expands FP8 support in vLLM from GPUs with hardware FP8 support (Hopper and Ada Lovelace) to GPUs without native support (currently Ampere) by introducing FP8 Marlin - a fast fused dequantization kernel for FP8 to BF16/FP16 conversion.

Key features:

Enables FP8 quantization on a wider range of GPUs (SM 8.0 and 8.7, Ampere)
Improves performance up to 2x in memory-bound scenarios
Maintains accuracy comparable to FP16 baselines
Reduces weight memory usage by 2x, allowing larger batches
Simple to use - just specify quantization="fp8" at runtime or use pre-quantized FP8 checkpoints

Implementation details:

Based on existing 8-bit integer support in GPTQ Marlin kernel
Packs FP8 weights into int32 doublewords (GPTQ format) and then permutes weights into Marlin format
Efficient 4xFP8 to 4xFP16/BF16 dequantization using bit arithmetic and SIMT operations

End-to-end performance and accuracy results:

Individual layer sweeps:

As shown in the graphs, FP8 Marlin can provide significant speedups with minimal accuracy impact. Performance gains are higher on GPUs with less memory bandwidth (A10, RTX 3090) and for larger models.

Notes:

This weight-only approach differs slightly from the existing W8A8 FP8 quantization, offering higher accuracy because the activations have no need to be quantized
Currently expanding scales to be channelwise; future work will revert to per-tensor scales
This does not include support for MoE models.

Testing:

Tested on H100, A100, and A10 GPUs

This enhancement enables more users to benefit from FP8 quantization without hardware restrictions, improving vLLM's performance and efficiency across a broader range of setups!

robertgshaw2-redhat · 2024-07-02T01:54:39Z

This is an awesome feature!

vllm/model_executor/layers/quantization/fp8.py

comaniac

Overall LGTM. Thanks!

tests/kernels/test_marlin_gemm.py

…ject#5975)

fxmarty · 2024-07-19T14:39:50Z

@mgoin awesome feature! I suppose that the perf benchmark was run with cuda graph enabled? Out of curiosity, did you run it without cuda graph?

As this kernel has been integrated in TGI as well, it appears having CUDA graph enabled is rather critical so as to get speedups in the decoding (which I don't really explain to myself - but haven't profiled). In the prefill, as cuda graphs are never used for long enough seqlens, I do get a slight slowdown.

I did not benchmark on vllm, but I suppose the trend is similar. Probably depends on the gpu/tp config/model as well.

related: huggingface/optimum-quanto#241 (comment)

mgoin · 2024-07-19T15:07:27Z

Glad you're enjoying it @fxmarty. Thanks for sharing your analysis. My end-to-end benchmarks were all done with cuda graphs enabled as this is the default in vLLM. Note that it is expected to see a slight slow-down at prefill (M>256), we trade this off to see the improvements at decode.

I'm curious, have you seen the same difference for marlin int8 or int4? Aside from this, I think there could be additional tuning for A100 problem shapes.

…ject#5975)

HPC4AI · 2024-08-26T12:07:56Z

Hello, I noticed that you used the dequant_8bit function to dequantize FP8 data to FP16 data, but I'm not clear on the underlying principle. Could you please also provide the code for quantizing FP16 to FP8? Thanks.

…ject#5975) Signed-off-by: Alvant <[email protected]>

AllenDou · 2024-11-25T15:41:22Z

Hello, I noticed that you used the dequant_8bit function to dequantize FP8 data to FP16 data, but I'm not clear on the underlying principle. Could you please also provide the code for quantizing FP16 to FP8? Thanks.

vllm/csrc/quantization/fp8/fp8_marlin.cu

Line 164 in 2b0879b

__device__ inline typename ScalarType<nv_bfloat16>::FragB

https://github.com/IST-DASLab/marlin/blob/1f25790bdd49fba53106164a24666dade68d7c90/marlin/marlin_cuda_kernel.cu#L131

https://github.com/NVIDIA/FasterTransformer/blob/main/src/fastertransformer/cutlass_extensions/include/cutlass_extensions/interleaved_numeric_conversion.h

yankaifyyy · 2025-03-08T03:46:57Z

Does this feature support deploying FP8 version DeepSeek-R1 on an A800 server? We "use pre-quantized FP8 checkpoints" but it doesn't work

vllm-project/vllm#5975

github.com /vllm-project/vllm/pull/5975

ubergarm · 2025-03-12T19:13:41Z

@yankaifyyy

Does this feature support deploying FP8 version DeepSeek-R1 on an A800 server? We "use pre-quantized FP8 checkpoints" but it doesn't work

Reading the PR notes closely it says:

This does not include support for MoE models.

If you want to run Full DeepSeek R1 671B on an A800 check out ktransformers though older Ampere hardware still can't make use of the fp8 hybrid quants, you can use GGUFs just fine for single user inference.

Sorry for necro spamming this old PR haha, I didn't realize my testing branches commit messages would auto tag here...

…ject#5975) Signed-off-by: LeiWang1999 <[email protected]>

mgoin added 12 commits June 21, 2024 20:03

Add marlin implementation for fp8 decompress

7c51175

It runs! And then crashes

a90212a

Merge branch 'upstream-main' into fp8-marlin

9c89164

Latest state, "Invalid __global__ read"

f61c75f

Cleanup

0a595a9

Fix issues with workspace and add custom packing kernel

9c825ad

Cleanup cmake debug

a14123c

Add a unit test for packing

5b424d9

Everything works!

3c91b98

Cleanup

1244ab1

It works AND it is fast

0854c81

Add unit test

2fc0c47

mgoin changed the title ~~[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #331~~ [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin Jun 28, 2024

mgoin added 3 commits July 1, 2024 22:56

Merge branch 'upstream-main' into fp8-marlin

c4b3e0d

Fix merge

abbf3ff

Fix comment

ec58535

Merge branch 'upstream-main' into fp8-marlin

2490f06

robertgshaw2-redhat reviewed Jul 3, 2024

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

comaniac approved these changes Jul 3, 2024

View reviewed changes

tests/kernels/test_marlin_gemm.py Show resolved Hide resolved

mgoin added 3 commits July 3, 2024 13:58

Merge branch 'upstream-main' into fp8-marlin

ff97364

Use new platform capability interface

26ad9bb

Try to work around current_platform issues

aa52a08

mgoin enabled auto-merge (squash) July 3, 2024 16:30

mgoin merged commit 47f0954 into vllm-project:main Jul 3, 2024

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jul 7, 2024

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (vllm-pro…

04a3f7b

…ject#5975)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (vllm-pro…

9f6d182

…ject#5975)

mgoin mentioned this pull request Jul 19, 2024

[Model] Support Mistral-Nemo #6548

Merged

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (vllm-pro…

94726f3

…ject#5975)

mgoin mentioned this pull request Jul 31, 2024

[Doc]: Supported Hardware for Quantization Kernels #6979

Closed

JaheimLee mentioned this pull request Aug 9, 2024

[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel #7208

Merged

2 tasks

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (vllm-pro…

0ee91dd

…ject#5975) Signed-off-by: Alvant <[email protected]>

ubergarm added a commit to ubergarm/ktransformers that referenced this pull request Mar 11, 2025

WIP [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin

8c7440f

vllm-project/vllm#5975

ubergarm added a commit to ubergarm/ktransformers that referenced this pull request Mar 11, 2025

WIP [Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin

8cc5166

github.com /vllm-project/vllm/pull/5975

ubergarm mentioned this pull request Mar 12, 2025

error: fp8e4nv data type is not supported on CUDA arch < 89 triton-lang/triton#4319

Open

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin (vllm-pro…

f3845a0

…ject#5975) Signed-off-by: LeiWang1999 <[email protected]>

sarckk mentioned this pull request May 19, 2025

[Bug]: Llama4 Maverick FP8 on 4xA100: type fp8e4nv not supported in this architecture. The supported fp8 dtypes are ('fp8e4b15', 'fp8e5')") #18182

Closed

1 task

RodriMora mentioned this pull request Jul 2, 2025

[Feature] Migrate support for FP8 in Ampere GPUs sgl-project/sglang#7715

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

Uh oh!

mgoin commented Jun 28, 2024 •

edited

Loading

Uh oh!

robertgshaw2-redhat commented Jul 2, 2024

Uh oh!

Uh oh!

comaniac left a comment

Uh oh!

Uh oh!

fxmarty commented Jul 19, 2024 •

edited

Loading

Uh oh!

mgoin commented Jul 19, 2024 •

edited

Loading

Uh oh!

HPC4AI commented Aug 26, 2024

Uh oh!

AllenDou commented Nov 25, 2024

Uh oh!

yankaifyyy commented Mar 8, 2025

Uh oh!

ubergarm commented Mar 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Uh oh!

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

[Kernel] Expand FP8 support to Ampere GPUs using FP8 Marlin #5975

Uh oh!

Conversation

mgoin commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robertgshaw2-redhat commented Jul 2, 2024

Uh oh!

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fxmarty commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HPC4AI commented Aug 26, 2024

Uh oh!

AllenDou commented Nov 25, 2024

Uh oh!

yankaifyyy commented Mar 8, 2025

Uh oh!

ubergarm commented Mar 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

mgoin commented Jun 28, 2024 •

edited

Loading

fxmarty commented Jul 19, 2024 •

edited

Loading

mgoin commented Jul 19, 2024 •

edited

Loading