-
-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Description
Your current environment
Environment was set up by pulling the main branch and building the Dockerfile. Hardware was 4xA100 with an Azure Instance (Standard NC96ads A100 v4). Server image is: ubuntu-hpc (2204)
Startup:
python3 -m vllm.entrypoints.openai.api_server --port=8000 --host=0.0.0.0 --chat-template="/docker_share/models/internVL2-template.jinja" --model="/fine_tunes/internvl2_76b_hermes2_llama3_70b_dynamic_res_2nd_finetune" --tensor-parallel-size=4 --max-model-len=8192 --trust_remote_code --enforce-eager --max-lora-rank 128 --limit-mm-per-prompt image=4
🐛 Describe the bug
I have build from source with the current main branch to use online multi image inference with internVL2 76B (finetuned). First few inferences work with no issue. After like 10 calls the server crashes with following stack trace
The issue occurs when callen multithreaded and single threaded.
Somehow the bug doesnt happen when i remove --max-lora-rank 128 and set --max-model-len=6000
Stack trace
ERROR 09-11 05:24:13 async_llm_engine.py:63] Engine background task failed
ERROR 09-11 05:24:13 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-11 05:24:13 async_llm_engine.py:63] return_value = task.result()
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-11 05:24:13 async_llm_engine.py:63] result = task.result()
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-11 05:24:13 async_llm_engine.py:63] request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-11 05:24:13 async_llm_engine.py:63] outputs = await self.model_executor.execute_model_async(
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-11 05:24:13 async_llm_engine.py:63] return await self._driver_execute_model_async(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
ERROR 09-11 05:24:13 async_llm_engine.py:63] return await self.driver_exec_model(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 09-11 05:24:13 async_llm_engine.py:63] result = self.fn(*self.args, **self.kwargs)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
ERROR 09-11 05:24:13 async_llm_engine.py:63] inputs = self.prepare_input(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
ERROR 09-11 05:24:13 async_llm_engine.py:63] return self._get_driver_input_and_broadcast(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
ERROR 09-11 05:24:13 async_llm_engine.py:63] self.model_runner.prepare_model_input(
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
ERROR 09-11 05:24:13 async_llm_engine.py:63] model_input = self._prepare_model_input_tensors(
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors
ERROR 09-11 05:24:13 async_llm_engine.py:63] builder.add_seq_group(seq_group_metadata)
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group
ERROR 09-11 05:24:13 async_llm_engine.py:63] per_seq_group_fn(inter_data, seq_group_metadata)
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input
ERROR 09-11 05:24:13 async_llm_engine.py:63] mm_kwargs = self.multi_modal_input_mapper(mm_data)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input
ERROR 09-11 05:24:13 async_llm_engine.py:63] input_dict = plugin.map_input(model_config, data_value)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input
ERROR 09-11 05:24:13 async_llm_engine.py:63] return mapper(InputContext(model_config), data)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl
ERROR 09-11 05:24:13 async_llm_engine.py:63] data = torch.stack(data)
ERROR 09-11 05:24:13 async_llm_engine.py:63] ^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1
Exception in callback functools.partial(<function _log_task_completion at 0x7f2af3d3e2a0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f2af054a8a0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f2af3d3e2a0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f2af054a8a0>>)>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
return await self.driver_exec_model(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
inputs = self.prepare_input(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
return self._get_driver_input_and_broadcast(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
self.model_runner.prepare_model_input(
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
model_input = self._prepare_model_input_tensors(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors
builder.add_seq_group(seq_group_metadata)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group
per_seq_group_fn(inter_data, seq_group_metadata)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input
mm_kwargs = self.multi_modal_input_mapper(mm_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input
input_dict = plugin.map_input(model_config, data_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input
return mapper(InputContext(model_config), data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl
data = torch.stack(data)
^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-11 05:24:13 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-11 05:24:13 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-11 05:24:13 client.py:412] Traceback (most recent call last):
ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-11 05:24:13 client.py:412] await self.check_health(socket=socket)
ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-11 05:24:13 client.py:412] await self._send_one_way_rpc_request(
ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-11 05:24:13 client.py:412] raise response
ERROR 09-11 05:24:13 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR 09-11 05:24:13 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-11 05:24:13 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-11 05:24:13 client.py:412] Traceback (most recent call last):
ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-11 05:24:13 client.py:412] await self.check_health(socket=socket)
ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-11 05:24:13 client.py:412] await self._send_one_way_rpc_request(
ERROR 09-11 05:24:13 client.py:412] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-11 05:24:13 client.py:412] raise response
ERROR 09-11 05:24:13 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
CRITICAL 09-11 05:24:13 launcher.py:82] AsyncLLMEngine has failed, terminating server process
INFO: 10.151.92.18:51372 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-11 05:24:13 launcher.py:82] AsyncLLMEngine has failed, terminating server process
INFO: 10.151.92.18:51378 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1009]
INFO 09-11 05:24:13 server.py:228] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=RuntimeError('stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1')>
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
async for request_output in results_generator:
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
async for output in await self.add_request(
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
raise result
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
return_value = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
result = task.result()
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
request_outputs = await self.engine.step_async(virtual_engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
outputs = await self.model_executor.execute_model_async(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
return await self._driver_execute_model_async(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
return await self.driver_exec_model(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
inputs = self.prepare_input(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
return self._get_driver_input_and_broadcast(execute_model_req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
self.model_runner.prepare_model_input(
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
model_input = self._prepare_model_input_tensors(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors
builder.add_seq_group(seq_group_metadata)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group
per_seq_group_fn(inter_data, seq_group_metadata)
File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input
mm_kwargs = self.multi_modal_input_mapper(mm_data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input
input_dict = plugin.map_input(model_config, data_value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input
return mapper(InputContext(model_config), data)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl
data = torch.stack(data)
^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1
ERROR 09-11 05:24:14 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1215 died, exit code: -15
INFO 09-11 05:24:14 multiproc_worker_utils.py:123] Killing local vLLM worker processes
root@fee87fa97dfb:/vllm-workspace# /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.