-
Couldn't load subscription status.
- Fork 183
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
vLLM cannot run modelopt quantized weights. After following the examples of FP8 quantization in examples/llm_ptq, it succeeded with generating FP8 weights, but when I try to run it with vLLM it has an errors.
Steps/Code to reproduce bug
export HF_PATH=https://huggingface.co/mistralai/Mistral-Small-3.2-24B-Instruct-2506
scripts/huggingface_example.sh --model $HF_PATH --quant fp8 --export_fmt=hf
from vllm import LLM
llm_fp8 = LLM(model="<the exported model path>", quantization="modelopt")
print(llm_fp8.generate(["What's the age of the earth? "]))
Expected behavior
Do not fail/crash. Generate a response.
System information
- Container used (if applicable): ?
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): Ubuntu 24.04.1 LTS
- CPU architecture (x86_64, aarch64): x86_64
- GPU name (e.g. H100, A100, L40S): NVIDIA H100 80GB HBM3
- GPU memory size: 79.6 GB
- Number of GPUs: 4
- Library versions (if applicable):
- Python: 3.12.3
- ModelOpt version or commit hash: 0.31.0
- CUDA: 12.8
- PyTorch: 2.7.0a0+7c8ec84dab.nv25.03
- Transformers: 4.51.0
[TensorRT-LLM] TensorRT-LLM version: 0.19.0 - TensorRT-LLM: 0.19.0
- ONNXRuntime: 1.22.0
- TensorRT: 10.9.0.34
- Any other details that may help: ?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working