Skip to content

Can't load a just-built Pixtral quant; RuntimeError: start (0) + length (1280) exceeds dimension size (1024). #1127

@sjuxax

Description

@sjuxax

Describe the bug
Just built a Pixtral quant using the example script and git HEAD of llm-compressor. Can't load it in vLLM head, get RuntimeError: start (0) + length (1280) exceeds dimension size (1024).

Expected behavior
Expected model to run correctly.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Arch
  2. Python version [e.g. 3.7]: 3.12
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: caee1c8
  4. ML framework version(s) [e.g. torch 2.3.1]: torch 2.5.1
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
Name: compressed-tensors
Version: 0.9.1
---
Name: numpy
Version: 1.26.4
---
Name: vllm
Version: 0.7.2.dev59+g998669c7e.d20250205.cu128
  1. Other relevant environment information [e.g. hardware, CUDA version]: CUDA 12.8, GeForce RTX 3090Ti, nVidia 570.86.16

To Reproduce
Exact steps to reproduce the behavior:
Build a Pixtral quant and observe that vLLM can't load it.

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

vLLM Traceback
INFO 02-05 16:23:33 gpu_model_runner.py:867] Starting to load model /intnvme/models/pixtral-12b-W4A16-G128/...
INFO 02-05 16:23:33 config.py:2993] cudagraph sizes specified by model runner [] is overridden by config []
DEBUG 02-05 16:23:33 decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.qkv_proj
INFO 02-05 16:23:33 compressed_tensors_wNa16.py:85] Using MarlinLinearKernel for CompressedTensorsWNA16
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.o_proj
INFO 02-05 16:23:33 cuda.py:158] Using Flash Attention backend on V1 engine.
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.gate_up_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.down_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.qkv_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.o_proj
DEBUG 02-05 16:23:33 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.down_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.qkv_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.o_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.gate_up_proj
DEBUG 02-05 16:23:34 compressed_tensors.py:446] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.down_proj
No ROCm runtime is found, using ROCM_HOME='/opt/rocm'
INFO 02-05 16:23:34 topk_topp_sampler.py:36] Using FlashInfer for top-p & top-k sampling.
DEBUG 02-05 16:23:34 config.py:3408] enabled custom ops: Counter({'rms_norm': 130, 'silu_and_mul': 41, 'rotary_embedding': 1})
DEBUG 02-05 16:23:34 config.py:3410] disabled custom ops: Counter()
DEBUG 02-05 16:23:34 config.py:3408] enabled custom ops: Counter({'rms_norm': 130, 'silu_and_mul': 41, 'rotary_embedding': 1})
DEBUG 02-05 16:23:34 config.py:3410] disabled custom ops: Counter()
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
DEBUG 02-05 16:23:34 utils.py:154] Loaded weight lm_head.weight with shape torch.Size([131072, 5120])
ERROR 02-05 16:23:34 core.py:210] EngineCore hit an exception: Traceback (most recent call last):
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 202, in run_engine_core
ERROR 02-05 16:23:34 core.py:210]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 02-05 16:23:34 core.py:210]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 156, in __init__
ERROR 02-05 16:23:34 core.py:210]     super().__init__(vllm_config, executor_class)
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 51, in __init__
ERROR 02-05 16:23:34 core.py:210]     self.model_executor = executor_class(vllm_config)
ERROR 02-05 16:23:34 core.py:210]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 51, in __init__
ERROR 02-05 16:23:34 core.py:210]     self._init_executor()
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 42, in _init_executor
ERROR 02-05 16:23:34 core.py:210]     self.collective_rpc("load_model")
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 51, in collective_rpc
ERROR 02-05 16:23:34 core.py:210]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 02-05 16:23:34 core.py:210]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/utils.py", line 2220, in run_method
ERROR 02-05 16:23:34 core.py:210]     return func(*args, **kwargs)
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 143, in load_model
ERROR 02-05 16:23:34 core.py:210]     self.model_runner.load_model()
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 869, in load_model
ERROR 02-05 16:23:34 core.py:210]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 02-05 16:23:34 core.py:210]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 02-05 16:23:34 core.py:210]     return loader.load_model(vllm_config=vllm_config)
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 386, in load_model
ERROR 02-05 16:23:34 core.py:210]     loaded_weights = model.load_weights(
ERROR 02-05 16:23:34 core.py:210]                      ^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llava.py", line 727, in load_weights
ERROR 02-05 16:23:34 core.py:210]     return loader.load_weights(weights)
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 02-05 16:23:34 core.py:210]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 02-05 16:23:34 core.py:210]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 196, in _load_module
ERROR 02-05 16:23:34 core.py:210]     yield from self._load_module(prefix,
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 173, in _load_module
ERROR 02-05 16:23:34 core.py:210]     loaded_params = module_load_weights(weights)
ERROR 02-05 16:23:34 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 567, in load_weights
ERROR 02-05 16:23:34 core.py:210]     return loader.load_weights(
ERROR 02-05 16:23:34 core.py:210]            ^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 235, in load_weights
ERROR 02-05 16:23:34 core.py:210]     autoloaded_weights = set(self._load_module("", self.module, weights))
ERROR 02-05 16:23:34 core.py:210]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 196, in _load_module
ERROR 02-05 16:23:34 core.py:210]     yield from self._load_module(prefix,
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 173, in _load_module
ERROR 02-05 16:23:34 core.py:210]     loaded_params = module_load_weights(weights)
ERROR 02-05 16:23:34 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 427, in load_weights
ERROR 02-05 16:23:34 core.py:210]     weight_loader(param, loaded_weight, shard_id)
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 812, in weight_loader_v2
ERROR 02-05 16:23:34 core.py:210]     param.load_qkv_weight(loaded_weight=loaded_weight,
ERROR 02-05 16:23:34 core.py:210]   File "/home/jeff/.virtualenvs/vllm312/lib/python3.12/site-packages/vllm/model_executor/parameter.py", line 151, in load_qkv_weight
ERROR 02-05 16:23:34 core.py:210]     loaded_weight = loaded_weight.narrow(self.output_dim,
ERROR 02-05 16:23:34 core.py:210]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 02-05 16:23:34 core.py:210] RuntimeError: start (0) + length (1280) exceeds dimension size (1024).
ERROR 02-05 16:23:34 core.py:210]

Additional context
Add any other context about the problem here. Also include any relevant files.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions