Skip to content

[Bug]: Qwen3 30b a3b awq not working with vllm docker v0.8.5.post1 #17739

@kulievvitaly

Description

@kulievvitaly

Your current environment

docker v0.8.5.post1 ubuntu 24.04 rtx 3090ti

sudo docker run --ipc=host --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=0"' --env "VLLM_USE_MODELSCOPE=True" --env "VLLM_USE_V1=0" -p 8000:8000 --mount type=bind,source=/home/me/.cache,target=/root/.cache vllm/vllm-openai:v0.8.5 --model swift/Qwen3-30B-A3B-AWQ --max-model-len 4000 --dtype half --gpu-memory-utilization 0.92 --disable-log-requests --no-enable-prefix-caching -q awq --tensor-parallel-size 1

🐛 Describe the bug

sudo docker run --ipc=host --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=0"' --env "VLLM_USE_MODELSCOPE=True" --env "VLLM_USE_V1=0" -p 8000:8000 --mount type=bind,source=/home/me/.cache,target=/root/.cache vllm/vllm-openai:v0.8.5 --model swift/Qwen3-30B-A3B-AWQ --max-model-len 4000 --dtype half --gpu-memory-utilization 0.92 --disable-log-requests --no-enable-prefix-caching -q awq --tensor-parallel-size 1

me@am5:~$ sudo docker run --ipc=host --log-opt max-size=10m --log-opt max-file=1 --rm -it --gpus '"device=0"' --env "VLLM_USE_MODELSCOPE=True" --env "VLLM_USE_V1=0" -p 8000:8000 --mount type=bind,source=/home/m
e/.cache,target=/root/.cache vllm/vllm-openai:v0.8.5.post1 --model swift/Qwen3-30B-A3B-AWQ --max-model-len 4000 --dtype half --gpu-memory-utilization 0.92 --disable-log-requests --no-enable-prefix-caching -q aw
q --tensor-parallel-size 1
[sudo] password for me:
INFO 05-06 12:20:53 [init.py:239] Automatically detected platform cuda.
INFO 05-06 12:20:55 [api_server.py:1043] vLLM API server version 0.8.5.post1
INFO 05-06 12:20:55 [api_server.py:1044] args: Namespace(host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''],
allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_cert
s=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choi
ce=False, tool_call_parser=None, tool_parser_plugin='', model='swift/Qwen3-30B-A3B-AWQ', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer
revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFo
rmat.AUTO: 'auto'>, dtype='half', max_model_len=4000, guided_decoding_backend='xgrammar', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_par
allel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_m
emory_utilization=0.92, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_s
liding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization='awq', rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False
, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lor
a=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=
None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None
, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill

token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_
chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_exten
sion_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=True, max_log_le
n=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False)
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/swift/Qwen3-30B-A3B-AWQ
2025-05-06 12:20:59,705 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/swift/Qwen3-30B-A3B-AWQ
2025-05-06 12:21:02,873 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
INFO 05-06 12:21:08 [config.py:717] This model supports multiple tasks: {'generate', 'embed', 'score', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 05-06 12:21:08 [awq_marlin.py:117] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference
WARNING 05-06 12:21:08 [config.py:830] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 05-06 12:21:08 [api_server.py:246] Started engine process with PID 81 INFO 05-06 12:21:10 [init.py:239] Automatically detected platform cuda. Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/swift/Qwen3-30B-A3B-AWQ
INFO 05-06 12:21:10 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='swift/Qwen3-30B-A3B-AWQ', speculative_config=None, tokenizer='swift/Qwen3-30B-A3B-AWQ', skip_tokenizer_ini
t=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4000, download_dir=None, load_format=LoadFormat.AUTO,
tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decodin
g_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False),
seed=None, served_model_name=swift/Qwen3-30B-A3B-AWQ, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_p
reprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,15
2,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/swift/Qwen3-30B-A3B-AWQ
Downloading Model from https://www.modelscope.cn to directory: /root/.cache/modelscope/hub/models/swift/Qwen3-30B-A3B-AWQ
2025-05-06 12:21:19,777 - modelscope - WARNING - Using branch: master as version is unstable, use with caution
INFO 05-06 12:21:20 [cuda.py:292] Using Flash Attention backend.
INFO 05-06 12:21:20 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-06 12:21:20 [model_runner.py:1108] Starting to load model swift/Qwen3-30B-A3B-AWQ...
WARNING 05-06 12:21:20 [utils.py:168] The model class Qwen3MoeForCausalLM has not defined packed_modules_mapping, this may lead to incorrect mapping of quantized or ignored modules
ERROR 05-06 12:21:20 [engine.py:448]
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in init
self.model_executor = executor_class(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in init
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("load_model")
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
model = _initialize_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 488, in init
self.model = Qwen3MoeModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in init
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 334, in init
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 336, in
lambda prefix: Qwen3MoeDecoderLayer(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 278, in init
self.mlp = Qwen3MoeSparseMoeBlock(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 113, in init
self.experts = FusedMoE(num_experts=config.num_experts,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 499, in init
assert self.quant_method is not None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine
raise e
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
engine = MQLLMEngine.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
return cls(
^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 275, in init
self.model_executor = executor_class(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in init
self._init_executor()
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
self.collective_rpc("load_model")
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2456, in run_method
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 203, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1111, in load_model
self.model = get_model(vllm_config=self.vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
return loader.load_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 452, in load_model
model = _initialize_model(vllm_config=vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 133, in _initialize_model
return model_class(vllm_config=vllm_config, prefix=prefix)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 488, in init
self.model = Qwen3MoeModel(vllm_config=vllm_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in init
old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 334, in init
self.start_layer, self.end_layer, self.layers = make_layers(
^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 610, in make_layers
maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 336, in
lambda prefix: Qwen3MoeDecoderLayer(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 278, in init
self.mlp = Qwen3MoeSparseMoeBlock(config=config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_moe.py", line 113, in init
self.experts = FusedMoE(num_experts=config.num_experts,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 499, in init
assert self.quant_method is not None
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
[rank0]:[W506 12:21:20.872054709 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/
docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in
uvloop.run(run_server(args))
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start. See stack trace for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions