feat: add health_generate route to openai serving #3856

dsingal0 · 2025-04-25T07:11:05Z

Adds /health_generate for kubernetes to detect when the runtime is hung so it can restart the container/pod even if the server /health returns a 200.
This is helpful in cases when a long context request from a user has caused the llm runtime to hang or die but the fastapi server is still running.

dsingal0 · 2025-04-25T15:57:18Z

@kaiyux wdyt?

kaiyux · 2025-05-01T15:38:52Z

@kaiyux wdyt?

@dsingal0 Thanks for the suggestion. Some members in the team are on public holidays and will return next week, we will keep you posted.

At first glance I think we should be careful when introducing a new API that is not within OpenAI API scope, and I'll take a closer look on that.

LinPoly

For the added entrypoint function, it is acceptable to check if the runtime/executor/engine is alive. But similar logic can also be added to the client code instead of the server side, set a timeout for a simple request, then check if there is any response returned by the server and the response status. Do we have any reason to implement the logic in the server side? @dsingal0

LinPoly · 2025-05-06T09:23:41Z

FYI: vLLM monitors executor health with daemon thread, it is more structured and reliable for me, but we need to implement similar checking from executor level. @kaiyux

dsingal0 · 2025-05-06T16:27:10Z

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

LinPoly · 2025-05-07T06:21:45Z

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

So you mean K8S is using server side health check for error detection and auto-scaling? If so I think it is reasonable to add such a WAR, we can work on a more structured solution afterwards. @kaiyux for opinions.

tensorrt_llm/serve/openai_server.py

Superjomn · 2025-05-09T05:36:54Z

@kaiyux @penli9 , as we have no such check yet, I think a /health_generate is reasonable; we can refine the implementation during further iterations. BTW, we are going to introduce the heartbeat mechanism into the LLM-API level, and that could facilitate the implementation for /health -- it will check and update the status of the system periodically in a cheap way, but I think the /health_generate can always return a real-time status once it is invoked.

kaiyux · 2025-05-09T06:10:58Z

@dsingal0 Can you also help fix the DCO check? https://github.com/NVIDIA/TensorRT-LLM/pull/3856/checks?check_run_id=41920082141

Summary
Commit sha: 2cd150e, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: a12154a, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: 06f3d21, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.

You can follow the guidance here to sign off https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#signing-your-work.

Please let us know for any questions, thanks!

dsingal0 · 2025-05-09T06:39:09Z

@kaiyux done

tensorrt_llm/_torch/speculative/mtp.py

tensorrt_llm/serve/openai_server.py

kaiyux · 2025-05-09T06:56:15Z

@dsingal0 Thanks for fixing the DCO check. Mind squashing the commits? Not sure why there were a lot of un-related commits introduced, the diff looks good though.

dsingal0 · 2025-05-09T07:14:58Z

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

dsingal0 · 2025-05-09T07:16:50Z

alternatively, I could open a new PR

kaiyux · 2025-05-09T09:07:35Z

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

Thanks a lot! I think we should be fine merge this one since we will squash the commits when merge it, @chzblych do you think differently?

kaiyux · 2025-05-09T09:07:58Z

/bot run

tensorrt-cicd · 2025-05-09T09:13:20Z

PR_Github #4694 [ run ] triggered by Bot

kaiyux · 2025-05-09T10:02:52Z

/bot run

tensorrt-cicd · 2025-05-09T10:08:36Z

PR_Github #4705 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-09T10:08:39Z

PR_Github #4694 [ run ] completed with state ABORTED

tensorrt-cicd · 2025-05-09T15:16:18Z

PR_Github #4705 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3393 completed with status: 'FAILURE'

kaiyux · 2025-05-11T07:24:55Z

@dsingal0 Can you help fix the failed test case?

[2025-05-09T14:06:26.867Z] ../../unittest/llmapi/apps/_test_llm_server.py .F....                    [100%]
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z] =================================== FAILURES ===================================
[2025-05-09T14:06:26.867Z] _____________________________ test_health_generate _____________________________
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z] client = <starlette.testclient.TestClient object at 0x7f632c50bf50>
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z]     def test_health_generate(client):
[2025-05-09T14:06:26.867Z]         response = client.get("/health_generate")
[2025-05-09T14:06:26.867Z] >       assert response.status_code == 200
[2025-05-09T14:06:26.867Z] E       assert 404 == 200
[2025-05-09T14:06:26.867Z] E        +  where 404 = <Response [404 Not Found]>.status_code
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z] ../../unittest/llmapi/apps/_test_llm_server.py:40: AssertionError

dsingal0 · 2025-05-12T07:08:17Z

@kaiyux I believe that fixed it, the test was failing since I had it in the llm api server, not the openai server.

kaiyux · 2025-05-12T11:34:19Z

/bot run

tensorrt-cicd · 2025-05-12T11:40:10Z

PR_Github #4873 [ run ] triggered by Bot

tensorrt-cicd · 2025-05-12T14:57:07Z

PR_Github #4873 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #3531 completed with status: 'FAILURE'

kaiyux · 2025-05-13T13:29:50Z

@dsingal0 Can you help fix the style check? : ) I would like to fix it but I noticed the PR is based on your main branch and I think I should not push something onto that.

kaiyux · 2025-05-15T07:00:55Z

Hi @dsingal0 , I fixed the style check and cherry-picked the changes to #4349, I marked you as co-author. Please let me know for any questions.

dsingal0 · 2025-05-15T07:18:27Z

Amazing, thank you so much @kaiyux

kaiyux · 2025-05-15T07:45:45Z

Amazing, thank you so much @kaiyux

No problems, thanks a lot for the great support!

Co-authored-by: Dhruv Singal <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]>

…#4349) Cherry-pick #3856 Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Dhruv Singal <[email protected]>

…#3856) (NVIDIA#4349) Cherry-pick NVIDIA#3856 Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Dhruv Singal <[email protected]> Signed-off-by: darraghdog <[email protected]>

juney-nvidia changed the title ~~[feat] add health_generate route to openai serving~~ feat: add health_generate route to openai serving Apr 26, 2025

juney-nvidia added Community want to contribute PRs initiated from Community Community Engagement help/insights needed from community labels Apr 26, 2025

dsingal0 requested a review from a team as a code owner April 29, 2025 18:43

kaiyux requested review from Superjomn, kaiyux and LinPoly and removed request for a team May 1, 2025 15:35

LinPoly reviewed May 6, 2025

View reviewed changes

Superjomn reviewed May 7, 2025

View reviewed changes

tensorrt_llm/serve/openai_server.py Outdated Show resolved Hide resolved

dsingal0 force-pushed the main branch from 2c98761 to d78a50a Compare May 9, 2025 06:33

kaiyux reviewed May 9, 2025

View reviewed changes

tensorrt_llm/_torch/speculative/mtp.py Outdated Show resolved Hide resolved

tensorrt_llm/serve/openai_server.py Outdated Show resolved Hide resolved

dsingal0 force-pushed the main branch from 7c39838 to e9300fb Compare May 9, 2025 07:13

kaiyux mentioned this pull request May 15, 2025

feat: add health_generate route to openai serving (Cherry-pick https://github.com/NVIDIA/TensorRT-LLM/pull/3856) #4349

Merged

dsingal0 closed this May 20, 2025

dsingal0 force-pushed the main branch from b7cac4e to de409e8 Compare May 20, 2025 03:25

Superjomn mentioned this pull request May 21, 2025

[RFC] Health check API in LLM-API #4513

Open

kaiyux added a commit to kaiyux/TensorRT-LLM that referenced this pull request May 21, 2025

Cherry-pick NVIDIA#3856

94e75d1

Co-authored-by: Dhruv Singal <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]>

kaiyux added a commit that referenced this pull request May 22, 2025

feat: add health_generate route to openai serving (Cherry-pick #3856) (…

2898d26

…#4349) Cherry-pick #3856 Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Dhruv Singal <[email protected]>

feat: add health_generate route to openai serving #3856

feat: add health_generate route to openai serving #3856

Uh oh!

Conversation

dsingal0 commented Apr 25, 2025

Uh oh!

dsingal0 commented Apr 25, 2025

Uh oh!

kaiyux commented May 1, 2025

Uh oh!

LinPoly left a comment

Choose a reason for hiding this comment

Uh oh!

LinPoly commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsingal0 commented May 6, 2025

Uh oh!

LinPoly commented May 7, 2025

Uh oh!

Uh oh!

Superjomn commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kaiyux commented May 9, 2025

Uh oh!

dsingal0 commented May 9, 2025

Uh oh!

Uh oh!

Uh oh!

kaiyux commented May 9, 2025

Uh oh!

dsingal0 commented May 9, 2025

Uh oh!

dsingal0 commented May 9, 2025

Uh oh!

kaiyux commented May 9, 2025

Uh oh!

kaiyux commented May 9, 2025

Uh oh!

tensorrt-cicd commented May 9, 2025

Uh oh!

kaiyux commented May 9, 2025

Uh oh!

tensorrt-cicd commented May 9, 2025

Uh oh!

tensorrt-cicd commented May 9, 2025

Uh oh!

tensorrt-cicd commented May 9, 2025

Uh oh!

kaiyux commented May 11, 2025

Uh oh!

dsingal0 commented May 12, 2025

Uh oh!

kaiyux commented May 12, 2025

Uh oh!

tensorrt-cicd commented May 12, 2025

Uh oh!

tensorrt-cicd commented May 12, 2025

Uh oh!

kaiyux commented May 13, 2025

Uh oh!

kaiyux commented May 15, 2025

Uh oh!

dsingal0 commented May 15, 2025

Uh oh!

kaiyux commented May 15, 2025

Uh oh!

Uh oh!

LinPoly commented May 6, 2025 •

edited

Loading

Superjomn commented May 9, 2025 •

edited

Loading