-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Closed
Labels
usageHow to use vllmHow to use vllm
Description
Your current environment
Qwen3-Reranker with ragflow have error
vllm-openai-8002:
runtime: nvidia
# 只使用 gpu 1
deploy:
resources:
reservations:
devices:
- device_ids: ["1"]
capabilities: ["gpu"]
driver: "nvidia"
environment:
- CUDA_VISIBLE_DEVICES=1
# command: --model /models/safetensors/Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --served-model-name Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --gpu-memory-utilization 0.75 --kv-cache-dtype fp8 --max_model_len 61440 --max-num-batched-tokens 61440
command: >
--model /models/safetensors/Qwen/Qwen3-Reranker-4B
--served-model-name Qwen/Qwen3-Reranker-4B
--gpu-memory-utilization 0.7
--hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token": ["no", "yes"],"is_original_qwen3_reranker": true}'
volumes:
- ./models/.cache/huggingface:/root/.cache/huggingface
- ./models/safetensors:/models/safetensors
dns:
- 8.8.8.8
ports:
- 8002:8000
ipc: host
image: vllm/vllm-openai:v0.10.1.1
(APIServer pid=1) WARNING 09-25 01:41:57 [protocol.py:81] The following fields were present in the request but ignored: {'return_documents'}
(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.1) with config: model='/models/safetensors/Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='/models/safetensors/Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=None, dimensions=None, activation=None, softmax=None, step_tag_id=None, returned_token_ids=None, enable_chunked_processing=None, max_embed_len=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b/rank_0_0/backbone"},
(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-9,prompt_token_ids_len=103,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 165, 166, 167, 168, 169, 170],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-10,prompt_token_ids_len=175,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-11,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-12,prompt_token_ids_len=187,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-13,prompt_token_ids_len=192,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-14,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-15,prompt_token_ids_len=216,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 116, 157, 158, 159, 160, 161, 243, 244, 245, 246, 247, 248, 249],),num_computed_tokens=112,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-16,prompt_token_ids_len=235,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-17,prompt_token_ids_len=476,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-18,prompt_token_ids_len=306,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 293, 294, 295],),num_computed_tokens=16,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-8'], resumed_from_preemption=[false], new_token_ids=[], new_block_ids=[[[164]]], num_computed_tokens=[130]), num_scheduled_tokens={rerank-cde819ff5e504189aa1a75c300dea704-12: 171, rerank-cde819ff5e504189aa1a75c300dea704-15: 104, rerank-cde819ff5e504189aa1a75c300dea704-14: 307, rerank-cde819ff5e504189aa1a75c300dea704-17: 460, rerank-cde819ff5e504189aa1a75c300dea704-11: 307, rerank-cde819ff5e504189aa1a75c300dea704-10: 159, rerank-cde819ff5e504189aa1a75c300dea704-8: 22, rerank-cde819ff5e504189aa1a75c300dea704-9: 87, rerank-cde819ff5e504189aa1a75c300dea704-13: 176, rerank-cde819ff5e504189aa1a75c300dea704-18: 36, rerank-cde819ff5e504189aa1a75c300dea704-16: 219}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-5', 'rerank-cde819ff5e504189aa1a75c300dea704-2', 'rerank-cde819ff5e504189aa1a75c300dea704-4', 'rerank-cde819ff5e504189aa1a75c300dea704-6', 'rerank-cde819ff5e504189aa1a75c300dea704-7', 'rerank-cde819ff5e504189aa1a75c300dea704-3'], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(APIServer pid=1) WARNING 09-25 01:41:57 [protocol.py:81] The following fields were present in the request but ignored: {'return_documents'}
(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.1) with config: model='/models/safetensors/Qwen/Qwen3-Reranker-4B', speculative_config=None, tokenizer='/models/safetensors/Qwen/Qwen3-Reranker-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Reranker-4B, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=PoolerConfig(pooling_type='LAST', normalize=None, dimensions=None, activation=None, softmax=None, step_tag_id=None, returned_token_ids=None, enable_chunked_processing=None, max_embed_len=None), compilation_config={"level":3,"debug_dump_path":"","cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":"/root/.cache/vllm/torch_compile_cache/b51f96a49b/rank_0_0/backbone"},
(EngineCore_0 pid=268) ERROR 09-25 01:41:58 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-9,prompt_token_ids_len=103,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 165, 166, 167, 168, 169, 170],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-10,prompt_token_ids_len=175,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-11,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-12,prompt_token_ids_len=187,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-13,prompt_token_ids_len=192,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-14,prompt_token_ids_len=323,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-15,prompt_token_ids_len=216,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 116, 157, 158, 159, 160, 161, 243, 244, 245, 246, 247, 248, 249],),num_computed_tokens=112,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-16,prompt_token_ids_len=235,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259, 260, 261, 262, 263],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-17,prompt_token_ids_len=476,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 264, 265, 266, 267, 268, 269, 270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292],),num_computed_tokens=16,lora_request=None), NewRequestData(req_id=rerank-cde819ff5e504189aa1a75c300dea704-18,prompt_token_ids_len=306,mm_kwargs=[],mm_hashes=[],mm_positions=[],sampling_params=None,block_ids=([3, 293, 294, 295],),num_computed_tokens=16,lora_request=None)], scheduled_cached_reqs=CachedRequestData(req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-8'], resumed_from_preemption=[false], new_token_ids=[], new_block_ids=[[[164]]], num_computed_tokens=[130]), num_scheduled_tokens={rerank-cde819ff5e504189aa1a75c300dea704-12: 171, rerank-cde819ff5e504189aa1a75c300dea704-15: 104, rerank-cde819ff5e504189aa1a75c300dea704-14: 307, rerank-cde819ff5e504189aa1a75c300dea704-17: 460, rerank-cde819ff5e504189aa1a75c300dea704-11: 307, rerank-cde819ff5e504189aa1a75c300dea704-10: 159, rerank-cde819ff5e504189aa1a75c300dea704-8: 22, rerank-cde819ff5e504189aa1a75c300dea704-9: 87, rerank-cde819ff5e504189aa1a75c300dea704-13: 176, rerank-cde819ff5e504189aa1a75c300dea704-18: 36, rerank-cde819ff5e504189aa1a75c300dea704-16: 219}, total_num_scheduled_tokens=2048, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[1], finished_req_ids=['rerank-cde819ff5e504189aa1a75c300dea704-5', 'rerank-cde819ff5e504189aa1a75c300dea704-2', 'rerank-cde819ff5e504189aa1a75c300dea704-4', 'rerank-cde819ff5e504189aa1a75c300dea704-6', 'rerank-cde819ff5e504189aa1a75c300dea704-7', 'rerank-cde819ff5e504189aa1a75c300dea704-3'], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=13, num_waiting_reqs=40, step_counter=0, current_wave=0, kv_cache_usage=0.014106050305914386, prefix_cache_stats=PrefixCacheStats(reset=False, requests=12, queries=2708, hits=320), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] EngineCore encountered a fatal error.
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] Traceback (most recent call last):
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] engine_core.run_busy_loop()
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] self._process_engine_step()
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] outputs, model_executed = self.step_fn()
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] raise err
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] return model_fn(scheduler_output)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] output = self.collective_rpc("execute_model",
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3007, in run_method
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] return func(*args, **kwargs)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] return func(*args, **kwargs)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] output = self.model_runner.execute_model(scheduler_output,
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] return func(*args, **kwargs)
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1522, in execute_model
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] max_query_len) = (self._prepare_inputs(scheduler_output))
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 712, in _prepare_inputs
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] tokens = [scheduler_output.num_scheduled_tokens[i] for i in req_ids]
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(EngineCore_0 pid=268) ERROR 09-25 01:51:45 [core.py:702] KeyError: None
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 09-25 01:51:45 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_0 pid=268) Process EngineCore_0:
(EngineCore_0 pid=268) Traceback (most recent call last):
(EngineCore_0 pid=268) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=268) self.run()
(EngineCore_0 pid=268) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=268) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 704, in run_engine_core
(EngineCore_0 pid=268) raise e
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=268) engine_core.run_busy_loop()
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=268) self._process_engine_step()
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=268) outputs, model_executed = self.step_fn()
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=268) model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=268) raise err
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=268) return model_fn(scheduler_output)
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 87, in execute_model
(EngineCore_0 pid=268) output = self.collective_rpc("execute_model",
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=268) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 3007, in run_method
(EngineCore_0 pid=268) return func(*args, **kwargs)
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=268) return func(*args, **kwargs)
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(EngineCore_0 pid=268) output = self.model_runner.execute_model(scheduler_output,
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(EngineCore_0 pid=268) return func(*args, **kwargs)
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1522, in execute_model
(EngineCore_0 pid=268) max_query_len) = (self._prepare_inputs(scheduler_output))
(EngineCore_0 pid=268) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=268) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 712, in _prepare_inputs
(EngineCore_0 pid=268) tokens = [scheduler_output.num_scheduled_tokens[i] for i in req_ids]
(EngineCore_0 pid=268) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
(EngineCore_0 pid=268) KeyError: None
(APIServer pid=1) INFO: 172.24.0.1:51940 - "POST /v1/rerank HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
(APIServer pid=1) INFO: Finished server process [1]
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
usageHow to use vllmHow to use vllm