[Bug]: Follow-up shutdown and logging issues

### 🐛 Describe the bug

This issue is a parking lot for edge-cases related to shutdown and logging which will require additional changes in order to be handled correctly by vLLM v1, even after #11737 lands. The goal is that when vLLM shuts down - whether intentionally or due to an internal failure - the cause of shutdown should be logged with a useful level of detail, and the server's resources (especially GPU memory) should be freed.

* Process monitor for engine core process. #11737 adds this for the TP workers but currently I don't think things will shut down cleanly if you kill the engine core proc without warning.

* A bug not addressed by #11737 : when an `LLM` instance is created with multiprocessing disabled, deleting the `LLM` instance using `del` does not free the engine's weight memory on the GPU, resulting in OOM errors for subsequent tests. This appears to happen because the in-process engine core client does not free weight memory as part of shutdown. It may also be the case that the worker does have any logic for explicitly deleting the PyTorch model layers. In contrast, with multiprocessing enabled, GPU weight memory is freed when the worker process(es) get killed.

* While #11737 mostly addresses clean shutdown of AsyncLLM when it is garbage collected, it's not yet completely robust. Removing the explicit calls to shutdown in `test_async_llm.py` now works most of the time but occasionally doesn't (which causes subsequent test to fail with OOM).

* Not technically a bug, but some of the exception stack traces associated with shutdown scenarios are extremely verbose and redundant and could be suppressed without reducing usefulness to the user (this is especially true as of #11737 which adds more error handling logic around shutdown scenarios)

* Edge-cases which are not unit-tested as of #11737 , but should be:
  * Add shutdown unit tests for handling hard-kill of worker processes and engine core proc (latter in TP and non-TP cases)
  * End-to-end shutdown tests against API endpoint
  * Shutdown unit tests for data-parallel (DP) scenario
  * Tentatively: error during utility call, error during abort, handle errors in IPC mechanisms

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Follow-up shutdown and logging issues #16667

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Follow-up shutdown and logging issues #16667

Description

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions