Skip to content

[Usage]: Qwen3-Reranker with ragflow have error #25659

@ooodwbooo

Description

@ooodwbooo

Your current environment

Qwen3-Reranker with ragflow have error

  vllm-openai-8002:
    runtime: nvidia
    # 只使用 gpu 1
    deploy:
      resources:
        reservations:
          devices:
            - device_ids: ["1"]
              capabilities: ["gpu"]
              driver: "nvidia"
    environment:
      - CUDA_VISIBLE_DEVICES=1
    # command: --model /models/safetensors/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --served-model-name Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --gpu-memory-utilization 0.75 --kv-cache-dtype fp8 --max_model_len 61440 --max-num-batched-tokens 61440
    command: >
      --model /models/safetensors/Qwen/Qwen3-Reranker-4B 
      --served-model-name Qwen/Qwen3-Reranker-4B  
      --gpu-memory-utilization 0.7
      --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
    volumes:
      - ./models/.cache/huggingface:/root/.cache/huggingface
      - ./models/safetensors:/models/safetensors
    dns:
      - 8.8.8.8
    ports:
      - 8002:8000
    ipc: host
    image: vllm/vllm-openai:v0.10.1.1
(APIServer pid=1) WARNING 09-25 01:41:57 [protocol.py:81] The following fields were present in the request but ignored: {'return_documents'}
(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.1) with config: model='/models/safetensors/Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='/models/safetensors/Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=None, dimensions=None, activation=None, softmax=None, step_tag_id=None, returned_token_ids=None, enable_chunked_processing=None, max_embed_len=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b/rank_0_0/backbone"}, 

(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-9,prompt_token_ids_len=103,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 165, 166, 167, 168, 169, 170],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-10,prompt_token_ids_len=175,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-11,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-12,prompt_token_ids_len=187,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-13,prompt_token_ids_len=192,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-14,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-15,prompt_token_ids_len=216,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 116, 157, 158, 159, 160, 161, 243, 244, 245, 246, 247, 248, 249],),num_computed_tokens=112,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-16,prompt_token_ids_len=235,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-17,prompt_token_ids_len=476,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-18,prompt_token_ids_len=306,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 293, 294, 295],),num_computed_tokens=16,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-8'], resumed_from_preemption=[false], new_token_ids=[], new_block_ids=[[[164]]], num_computed_tokens=[130]), num_scheduled_tokens={rerank-cde819ff5e504189aa1a75c300dea704-12: 171, rerank-cde819ff5e504189aa1a75c300dea704-15: 104, rerank-cde819ff5e504189aa1a75c300dea704-14: 307, rerank-cde819ff5e504189aa1a75c300dea704-17: 460, rerank-cde819ff5e504189aa1a75c300dea704-11: 307, rerank-cde819ff5e504189aa1a75c300dea704-10: 159, rerank-cde819ff5e504189aa1a75c300dea704-8: 22, rerank-cde819ff5e504189aa1a75c300dea704-9: 87, rerank-cde819ff5e504189aa1a75c300dea704-13: 176, rerank-cde819ff5e504189aa1a75c300dea704-18: 36, rerank-cde819ff5e504189aa1a75c300dea704-16: 219}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-5', 'rerank-cde819ff5e504189aa1a75c300dea704-2', 'rerank-cde819ff5e504189aa1a75c300dea704-4', 'rerank-cde819ff5e504189aa1a75c300dea704-6', 'rerank-cde819ff5e504189aa1a75c300dea704-7', 'rerank-cde819ff5e504189aa1a75c300dea704-3'], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(APIServer pid=1) WARNING 09-25 01:41:57 [protocol.py:81] The following fields were present in the request but ignored: {'return_documents'}

(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.1) with config: model='/models/safetensors/Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='/models/safetensors/Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=None, dimensions=None, activation=None, softmax=None, step_tag_id=None, returned_token_ids=None, enable_chunked_processing=None, max_embed_len=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b/rank_0_0/backbone"}, 

(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-9,prompt_token_ids_len=103,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 165, 166, 167, 168, 169, 170],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-10,prompt_token_ids_len=175,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-11,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-12,prompt_token_ids_len=187,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-13,prompt_token_ids_len=192,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-14,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-15,prompt_token_ids_len=216,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 116, 157, 158, 159, 160, 161, 243, 244, 245, 246, 247, 248, 249],),num_computed_tokens=112,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-16,prompt_token_ids_len=235,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-17,prompt_token_ids_len=476,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-18,prompt_token_ids_len=306,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 293, 294, 295],),num_computed_tokens=16,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-8'], resumed_from_preemption=[false], new_token_ids=[], new_block_ids=[[[164]]], num_computed_tokens=[130]), num_scheduled_tokens={rerank-cde819ff5e504189aa1a75c300dea704-12: 171, rerank-cde819ff5e504189aa1a75c300dea704-15: 104, rerank-cde819ff5e504189aa1a75c300dea704-14: 307, rerank-cde819ff5e504189aa1a75c300dea704-17: 460, rerank-cde819ff5e504189aa1a75c300dea704-11: 307, rerank-cde819ff5e504189aa1a75c300dea704-10: 159, rerank-cde819ff5e504189aa1a75c300dea704-8: 22, rerank-cde819ff5e504189aa1a75c300dea704-9: 87, rerank-cde819ff5e504189aa1a75c300dea704-13: 176, rerank-cde819ff5e504189aa1a75c300dea704-18: 36, rerank-cde819ff5e504189aa1a75c300dea704-16: 219}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-5', 'rerank-cde819ff5e504189aa1a75c300dea704-2', 'rerank-cde819ff5e504189aa1a75c300dea704-4', 'rerank-cde819ff5e504189aa1a75c300dea704-6', 'rerank-cde819ff5e504189aa1a75c300dea704-7', 'rerank-cde819ff5e504189aa1a75c300dea704-3'], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=13, num_waiting_reqs=40, step_counter=0, current_wave=0, kv_cache_usage=0.014106050305914386, prefix_cache_stats=PrefixCacheStats(reset=False, requests=12, queries=2708, hits=320), spec_decoding_stats=None, num_corrupted_reqs=0)

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] EngineCore encountered a fatal error.

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] Traceback (most recent call last):

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     engine_core.run_busy_loop()

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     self._process_engine_step()

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     outputs, model_executed = self.step_fn()

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]                               ^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 288, in step

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     model_output = self.execute_model_with_error_logging(

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     raise err

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     return model_fn(scheduler_output)

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]            ^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     output = self.collective_rpc("execute_model",

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     answer = run_method(self.driver_worker, method, args, kwargs)

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3007, in run_method

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     return func(*args, **kwargs)

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     return func(*args, **kwargs)

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     output = self.model_runner.execute_model(scheduler_output,

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     return func(*args, **kwargs)

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1522, in execute_model

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     max_query_len) = (self._prepare_inputs(scheduler_output))

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 712, in _prepare_inputs

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]     tokens = [scheduler_output.num_scheduled_tokens[i] for i in req_ids]

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702]               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] KeyError: None
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] AsyncLLM output_handler failed.

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] Traceback (most recent call last):

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430]     outputs = await engine_core.get_output_async()

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 843, in get_output_async

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430]     raise self._format_exception(outputs) from None

(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

(EngineCore_0 pid=268) Process EngineCore_0:

(EngineCore_0 pid=268) Traceback (most recent call last):

(EngineCore_0 pid=268)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap

(EngineCore_0 pid=268)     self.run()

(EngineCore_0 pid=268)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run

(EngineCore_0 pid=268)     self._target(*self._args, **self._kwargs)

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 704, in run_engine_core

(EngineCore_0 pid=268)     raise e

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core

(EngineCore_0 pid=268)     engine_core.run_busy_loop()

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop

(EngineCore_0 pid=268)     self._process_engine_step()

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step

(EngineCore_0 pid=268)     outputs, model_executed = self.step_fn()

(EngineCore_0 pid=268)                               ^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 288, in step

(EngineCore_0 pid=268)     model_output = self.execute_model_with_error_logging(

(EngineCore_0 pid=268)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging

(EngineCore_0 pid=268)     raise err

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging

(EngineCore_0 pid=268)     return model_fn(scheduler_output)

(EngineCore_0 pid=268)            ^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model

(EngineCore_0 pid=268)     output = self.collective_rpc("execute_model",

(EngineCore_0 pid=268)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc

(EngineCore_0 pid=268)     answer = run_method(self.driver_worker, method, args, kwargs)

(EngineCore_0 pid=268)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3007, in run_method

(EngineCore_0 pid=268)     return func(*args, **kwargs)

(EngineCore_0 pid=268)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

(EngineCore_0 pid=268)     return func(*args, **kwargs)

(EngineCore_0 pid=268)            ^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model

(EngineCore_0 pid=268)     output = self.model_runner.execute_model(scheduler_output,

(EngineCore_0 pid=268)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context

(EngineCore_0 pid=268)     return func(*args, **kwargs)

(EngineCore_0 pid=268)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1522, in execute_model

(EngineCore_0 pid=268)     max_query_len) = (self._prepare_inputs(scheduler_output))

(EngineCore_0 pid=268)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

(EngineCore_0 pid=268)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 712, in _prepare_inputs

(EngineCore_0 pid=268)     tokens = [scheduler_output.num_scheduled_tokens[i] for i in req_ids]

(EngineCore_0 pid=268)               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^

(EngineCore_0 pid=268) KeyError: None

(APIServer pid=1) INFO:     172.24.0.1:51940 - "POST /v1/rerank HTTP/1.1" 500 Internal Server Error

(APIServer pid=1) INFO:     Shutting down

(APIServer pid=1) INFO:     Waiting for application shutdown.

(APIServer pid=1) INFO:     Application shutdown complete.

(APIServer pid=1) INFO:     Finished server process [1]

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    usageHow to use vllm

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions