Skip to content

Can vllm become faster? #2327

@godsakurapeng

Description

@godsakurapeng

I find an artice Accelerating Generative AI with PyTorch II: GPT, Fast
The optimization used in this article is as shown below
image
I simply tried gpt-fast, the improvement is huge

codellama-python-7b, 2xA10(24G)

infer speed(token/s)
vllm fp16 45.2
gpt-fast fp16 66.5
gpt-fast int8 105.1
gpt-fast int4 145.9

ps: results of int4 is terrible

I'm curious, can these optimizations be used on vllm?
I can see some discussion about these optimizations, but it doesn't look like they will be possible in the short term (because of some problems about vllm?)

torch.compile

+34% higher throughput?
Compiled model with torch.compile, unfortunately without performance improvements

quantization

Add GPTQ support (I tried a version before but it didn't work well

Speculative Decoding

Speculative Decoding

vllm is a great project!! I really hope to see these optimizations in vllm. I also want to know the difficulties that still exist :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions