-
Notifications
You must be signed in to change notification settings - Fork 266
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Just built a Pixtral quant using the example script and git HEAD of llm-compressor. Can't load it in vLLM head, get RuntimeError: start (0) + length (1280) exceeds dimension size (1024).
Expected behavior
Expected model to run correctly.
Environment
Include all relevant environment information:
- OS [e.g. Ubuntu 20.04]: Arch
- Python version [e.g. 3.7]: 3.12
- LLM Compressor version or commit hash [e.g. 0.1.0,
f7245c8]: caee1c8 - ML framework version(s) [e.g. torch 2.3.1]: torch 2.5.1
- Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
Name: compressed-tensors
Version: 0.9.1
---
Name: numpy
Version: 1.26.4
---
Name: vllm
Version: 0.7.2.dev59+g998669c7e.d20250205.cu128
- Other relevant environment information [e.g. hardware, CUDA version]: CUDA 12.8, GeForce RTX 3090Ti, nVidia 570.86.16
To Reproduce
Exact steps to reproduce the behavior:
Build a Pixtral quant and observe that vLLM can't load it.
Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.
INFO 02-05 16:23:33 gpu_model_runner.py:867] Starting to load model /intnvme/models/pixtral-12b-W4A16-G128/...
INFO 02-05 16:23:33 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
DEBUG 02-05 16:23:33 decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.qkv_proj
INFO 02-05 16:23:33 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.o_proj
INFO 02-05 16:23:33 cuda.py:158] Using Flash Attention backend on V1 engine.
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.down_proj
No ROCm runtime is found, using ROCM_HOME='/opt/rocm'
INFO 02-05 16:23:34 topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
DEBUG 02-05 16:23:34 config.py:3408] enabled custom ops: Counter({'rms_norm': 130, 'silu_and_mul': 41, 'rotary_embedding': 1})
DEBUG 02-05 16:23:34 config.py:3410] disabled custom ops: Counter()
DEBUG 02-05 16:23:34 config.py:3408] enabled custom ops: Counter({'rms_norm': 130, 'silu_and_mul': 41, 'rotary_embedding': 1})
DEBUG 02-05 16:23:34 config.py:3410] disabled custom ops: Counter()
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
DEBUG 02-05 16:23:34 utils.py:154] Loaded weight lm_head.weight with shape torch.Size([131072, 5120])
ERROR 02-05 16:23:34 core.py:210] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 202, in run_engine_core
ERROR 02-05 16:23:34 core.py:210] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 156, in __init__
ERROR 02-05 16:23:34 core.py:210] super().__init__(vllm_config, executor_class)
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 51, in __init__
ERROR 02-05 16:23:34 core.py:210] self.model_executor = executor_class(vllm_config)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 51, in __init__
ERROR 02-05 16:23:34 core.py:210] self._init_executor()
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor
ERROR 02-05 16:23:34 core.py:210] self.collective_rpc("load_model")
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 02-05 16:23:34 core.py:210] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2220, in run_method
ERROR 02-05 16:23:34 core.py:210] return func(*args, **kwargs)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 143, in load_model
ERROR 02-05 16:23:34 core.py:210] self.model_runner.load_model()
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 869, in load_model
ERROR 02-05 16:23:34 core.py:210] self.model = get_model(vllm_config=self.vllm_config)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 02-05 16:23:34 core.py:210] return loader.load_model(vllm_config=vllm_config)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 386, in load_model
ERROR 02-05 16:23:34 core.py:210] loaded_weights = model.load_weights(
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llava.py", line 727, in load_weights
ERROR 02-05 16:23:34 core.py:210] return loader.load_weights(weights)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 02-05 16:23:34 core.py:210] autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 196, in _load_module
ERROR 02-05 16:23:34 core.py:210] yield from self._load_module(prefix,
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 173, in _load_module
ERROR 02-05 16:23:34 core.py:210] loaded_params = module_load_weights(weights)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 567, in load_weights
ERROR 02-05 16:23:34 core.py:210] return loader.load_weights(
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 02-05 16:23:34 core.py:210] autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 196, in _load_module
ERROR 02-05 16:23:34 core.py:210] yield from self._load_module(prefix,
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 173, in _load_module
ERROR 02-05 16:23:34 core.py:210] loaded_params = module_load_weights(weights)
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 427, in load_weights
ERROR 02-05 16:23:34 core.py:210] weight_loader(param, loaded_weight, shard_id)
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 812, in weight_loader_v2
ERROR 02-05 16:23:34 core.py:210] param.load_qkv_weight(loaded_weight=loaded_weight,
ERROR 02-05 16:23:34 core.py:210] File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 151, in load_qkv_weight
ERROR 02-05 16:23:34 core.py:210] loaded_weight = loaded_weight.narrow(self.output_dim,
ERROR 02-05 16:23:34 core.py:210] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] RuntimeError: start (0) + length (1280) exceeds dimension size (1024).
ERROR 02-05 16:23:34 core.py:210]
Additional context
Add any other context about the problem here. Also include any relevant files.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working