Skip to content

[Bug]: Follow-up shutdown and logging issues #16667

@afeldman-nm

Description

@afeldman-nm

🐛 Describe the bug

This issue is a parking lot for edge-cases related to shutdown and logging which will require additional changes in order to be handled correctly by vLLM v1, even after #11737 lands. The goal is that when vLLM shuts down - whether intentionally or due to an internal failure - the cause of shutdown should be logged with a useful level of detail, and the server's resources (especially GPU memory) should be freed.

  • Process monitor for engine core process. [V1][Frontend] Improve Shutdown And Logs #11737 adds this for the TP workers but currently I don't think things will shut down cleanly if you kill the engine core proc without warning.

  • A bug not addressed by [V1][Frontend] Improve Shutdown And Logs #11737 : when an LLM instance is created with multiprocessing disabled, deleting the LLM instance using del does not free the engine's weight memory on the GPU, resulting in OOM errors for subsequent tests. This appears to happen because the in-process engine core client does not free weight memory as part of shutdown. It may also be the case that the worker does have any logic for explicitly deleting the PyTorch model layers. In contrast, with multiprocessing enabled, GPU weight memory is freed when the worker process(es) get killed.

  • While [V1][Frontend] Improve Shutdown And Logs #11737 mostly addresses clean shutdown of AsyncLLM when it is garbage collected, it's not yet completely robust. Removing the explicit calls to shutdown in test_async_llm.py now works most of the time but occasionally doesn't (which causes subsequent test to fail with OOM).

  • Not technically a bug, but some of the exception stack traces associated with shutdown scenarios are extremely verbose and redundant and could be suppressed without reducing usefulness to the user (this is especially true as of [V1][Frontend] Improve Shutdown And Logs #11737 which adds more error handling logic around shutdown scenarios)

  • Edge-cases which are not unit-tested as of [V1][Frontend] Improve Shutdown And Logs #11737 , but should be:

    • Add shutdown unit tests for handling hard-kill of worker processes and engine core proc (latter in TP and non-TP cases)
    • End-to-end shutdown tests against API endpoint
    • Shutdown unit tests for data-parallel (DP) scenario
    • Tentatively: error during utility call, error during abort, handle errors in IPC mechanisms

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions