Skip to content

vLLM cannot run modelopt quantized weights #228

@Pernekhan

Description

@Pernekhan

Describe the bug

vLLM cannot run modelopt quantized weights. After following the examples of FP8 quantization in examples/llm_ptq, it succeeded with generating FP8 weights, but when I try to run it with vLLM it has an errors.

Steps/Code to reproduce bug

export HF_PATH=https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
scripts/huggingface_example.sh --model $HF_PATH --quant fp8 --export_fmt=hf
from vllm import LLM

llm_fp8 = LLM(model="<the exported model path>", quantization="modelopt")
print(llm_fp8.generate(["What's the age of the earth? "]))

Expected behavior

Do not fail/crash. Generate a response.

System information

  • Container used (if applicable): ?
  • OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.1 LTS
  • CPU architecture (x86_64, aarch64): x86_64
  • GPU name (e.g. H100, A100, L40S): NVIDIA H100 80GB HBM3
  • GPU memory size: 79.6 GB
  • Number of GPUs: 4
  • Library versions (if applicable):
    • Python: 3.12.3
    • ModelOpt version or commit hash: 0.31.0
    • CUDA: 12.8
    • PyTorch: 2.7.0a0+7c8ec84dab.nv25.03
    • Transformers: 4.51.0
      [TensorRT-LLM] TensorRT-LLM version: 0.19.0
    • TensorRT-LLM: 0.19.0
    • ONNXRuntime: 1.22.0
    • TensorRT: 10.9.0.34
  • Any other details that may help: ?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions