Skip to content

[Bug]: Crash after few multi image calls #8369

@Patrick10203

Description

@Patrick10203

Your current environment

Environment was set up by pulling the main branch and building the Dockerfile. Hardware was 4xA100 with an Azure Instance (Standard NC96ads A100 v4). Server image is: ubuntu-hpc (2204)

Startup:
python3 -m vllm.entrypoints.openai.api_server --port=8000 --host=0.0.0.0 --chat-template="/docker_share/models/internVL2-template.jinja" --model="/fine_tunes/internvl2_76b_hermes2_llama3_70b_dynamic_res_2nd_finetune" --tensor-parallel-size=4 --max-model-len=8192 --trust_remote_code --enforce-eager --max-lora-rank 128 --limit-mm-per-prompt image=4

🐛 Describe the bug

I have build from source with the current main branch to use online multi image inference with internVL2 76B (finetuned). First few inferences work with no issue. After like 10 calls the server crashes with following stack trace

The issue occurs when callen multithreaded and single threaded.
Somehow the bug doesnt happen when i remove --max-lora-rank 128 and set --max-model-len=6000

Stack trace
ERROR 09-11 05:24:13 async_llm_engine.py:63] Engine background task failed
ERROR 09-11 05:24:13 async_llm_engine.py:63] Traceback (most recent call last):
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
ERROR 09-11 05:24:13 async_llm_engine.py:63]     return_value = task.result()
ERROR 09-11 05:24:13 async_llm_engine.py:63]                    ^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
ERROR 09-11 05:24:13 async_llm_engine.py:63]     result = task.result()
ERROR 09-11 05:24:13 async_llm_engine.py:63]              ^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
ERROR 09-11 05:24:13 async_llm_engine.py:63]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 09-11 05:24:13 async_llm_engine.py:63]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
ERROR 09-11 05:24:13 async_llm_engine.py:63]     outputs = await self.model_executor.execute_model_async(
ERROR 09-11 05:24:13 async_llm_engine.py:63]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
ERROR 09-11 05:24:13 async_llm_engine.py:63]     return await self._driver_execute_model_async(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
ERROR 09-11 05:24:13 async_llm_engine.py:63]     return await self.driver_exec_model(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
ERROR 09-11 05:24:13 async_llm_engine.py:63]     result = self.fn(*self.args, **self.kwargs)
ERROR 09-11 05:24:13 async_llm_engine.py:63]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
ERROR 09-11 05:24:13 async_llm_engine.py:63]     inputs = self.prepare_input(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
ERROR 09-11 05:24:13 async_llm_engine.py:63]     return self._get_driver_input_and_broadcast(execute_model_req)
ERROR 09-11 05:24:13 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
ERROR 09-11 05:24:13 async_llm_engine.py:63]     self.model_runner.prepare_model_input(
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
ERROR 09-11 05:24:13 async_llm_engine.py:63]     model_input = self._prepare_model_input_tensors(
ERROR 09-11 05:24:13 async_llm_engine.py:63]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors
ERROR 09-11 05:24:13 async_llm_engine.py:63]     builder.add_seq_group(seq_group_metadata)
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group
ERROR 09-11 05:24:13 async_llm_engine.py:63]     per_seq_group_fn(inter_data, seq_group_metadata)
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input
ERROR 09-11 05:24:13 async_llm_engine.py:63]     mm_kwargs = self.multi_modal_input_mapper(mm_data)
ERROR 09-11 05:24:13 async_llm_engine.py:63]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input
ERROR 09-11 05:24:13 async_llm_engine.py:63]     input_dict = plugin.map_input(model_config, data_value)
ERROR 09-11 05:24:13 async_llm_engine.py:63]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input
ERROR 09-11 05:24:13 async_llm_engine.py:63]     return mapper(InputContext(model_config), data)
ERROR 09-11 05:24:13 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl
ERROR 09-11 05:24:13 async_llm_engine.py:63]     data = torch.stack(data)
ERROR 09-11 05:24:13 async_llm_engine.py:63]            ^^^^^^^^^^^^^^^^^
ERROR 09-11 05:24:13 async_llm_engine.py:63] RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1
Exception in callback functools.partial(<function _log_task_completion at 0x7f2af3d3e2a0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f2af054a8a0>>)
handle: <Handle functools.partial(<function _log_task_completion at 0x7f2af3d3e2a0>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f2af054a8a0>>)>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
    inputs = self.prepare_input(execute_model_req)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
    return self._get_driver_input_and_broadcast(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
    self.model_runner.prepare_model_input(
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
    model_input = self._prepare_model_input_tensors(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors
    builder.add_seq_group(seq_group_metadata)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group
    per_seq_group_fn(inter_data, seq_group_metadata)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input
    mm_kwargs = self.multi_modal_input_mapper(mm_data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input
    input_dict = plugin.map_input(model_config, data_value)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input
    return mapper(InputContext(model_config), data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl
    data = torch.stack(data)
           ^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 65, in _log_task_completion
    raise AsyncEngineDeadError(
vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 09-11 05:24:13 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-11 05:24:13 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-11 05:24:13 client.py:412] Traceback (most recent call last):
ERROR 09-11 05:24:13 client.py:412]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-11 05:24:13 client.py:412]     await self.check_health(socket=socket)
ERROR 09-11 05:24:13 client.py:412]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-11 05:24:13 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-11 05:24:13 client.py:412]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-11 05:24:13 client.py:412]     raise response
ERROR 09-11 05:24:13 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
ERROR 09-11 05:24:13 client.py:266] Got Unhealthy response from RPC Server
ERROR 09-11 05:24:13 client.py:412] AsyncEngineDeadError('Background loop is stopped.')
ERROR 09-11 05:24:13 client.py:412] Traceback (most recent call last):
ERROR 09-11 05:24:13 client.py:412]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 409, in generate
ERROR 09-11 05:24:13 client.py:412]     await self.check_health(socket=socket)
ERROR 09-11 05:24:13 client.py:412]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 429, in check_health
ERROR 09-11 05:24:13 client.py:412]     await self._send_one_way_rpc_request(
ERROR 09-11 05:24:13 client.py:412]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/client.py", line 267, in _send_one_way_rpc_request
ERROR 09-11 05:24:13 client.py:412]     raise response
ERROR 09-11 05:24:13 client.py:412] vllm.engine.async_llm_engine.AsyncEngineDeadError: Background loop is stopped.
CRITICAL 09-11 05:24:13 launcher.py:82] AsyncLLMEngine has failed, terminating server process
INFO:     10.151.92.18:51372 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
CRITICAL 09-11 05:24:13 launcher.py:82] AsyncLLMEngine has failed, terminating server process
INFO:     10.151.92.18:51378 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
INFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1009]
INFO 09-11 05:24:13 server.py:228] vLLM ZMQ RPC Server was interrupted.
Future exception was never retrieved
future: <Future finished exception=RuntimeError('stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1')>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 115, in generate
    async for request_output in results_generator:
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 1073, in generate
    async for output in await self.add_request(
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 111, in generator
    raise result
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 53, in _log_task_completion
    return_value = task.result()
                   ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 939, in run_engine_loop
    result = task.result()
             ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 868, in engine_step
    request_outputs = await self.engine.step_async(virtual_engine)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/async_llm_engine.py", line 345, in step_async
    outputs = await self.model_executor.execute_model_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/distributed_gpu_executor.py", line 177, in execute_model_async
    return await self._driver_execute_model_async(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/multiproc_gpu_executor.py", line 231, in _driver_execute_model_async
    return await self.driver_exec_model(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 303, in execute_model
    inputs = self.prepare_input(execute_model_req)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 291, in prepare_input
    return self._get_driver_input_and_broadcast(execute_model_req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
    self.model_runner.prepare_model_input(
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1380, in prepare_model_input
    model_input = self._prepare_model_input_tensors(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1038, in _prepare_model_input_tensors
    builder.add_seq_group(seq_group_metadata)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 664, in add_seq_group
    per_seq_group_fn(inter_data, seq_group_metadata)
  File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 636, in _compute_multi_modal_input
    mm_kwargs = self.multi_modal_input_mapper(mm_data)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/registry.py", line 125, in map_input
    input_dict = plugin.map_input(model_config, data_value)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/multimodal/base.py", line 265, in map_input
    return mapper(InputContext(model_config), data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/internvl.py", line 279, in input_mapper_for_internvl
    data = torch.stack(data)
           ^^^^^^^^^^^^^^^^^
RuntimeError: stack expects each tensor to be equal size, but got [7, 3, 448, 448] at entry 0 and [13, 3, 448, 448] at entry 1
ERROR 09-11 05:24:14 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1215 died, exit code: -15
INFO 09-11 05:24:14 multiproc_worker_utils.py:123] Killing local vLLM worker processes
root@fee87fa97dfb:/vllm-workspace# /usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions