Skip to content

Conversation

dsingal0
Copy link
Contributor

Adds /health_generate for kubernetes to detect when the runtime is hung so it can restart the container/pod even if the server /health returns a 200.
This is helpful in cases when a long context request from a user has caused the llm runtime to hang or die but the fastapi server is still running.

@dsingal0
Copy link
Contributor Author

@kaiyux wdyt?

@juney-nvidia juney-nvidia changed the title [feat] add health_generate route to openai serving feat: add health_generate route to openai serving Apr 26, 2025
@juney-nvidia juney-nvidia added Community want to contribute PRs initiated from Community Community Engagement help/insights needed from community labels Apr 26, 2025
@dsingal0 dsingal0 requested a review from a team as a code owner April 29, 2025 18:43
@kaiyux kaiyux requested review from Superjomn, kaiyux and LinPoly and removed request for a team May 1, 2025 15:35
@kaiyux
Copy link
Member

kaiyux commented May 1, 2025

@kaiyux wdyt?

@dsingal0 Thanks for the suggestion. Some members in the team are on public holidays and will return next week, we will keep you posted.

At first glance I think we should be careful when introducing a new API that is not within OpenAI API scope, and I'll take a closer look on that.

Copy link
Collaborator

@LinPoly LinPoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the added entrypoint function, it is acceptable to check if the runtime/executor/engine is alive. But similar logic can also be added to the client code instead of the server side, set a timeout for a simple request, then check if there is any response returned by the server and the response status. Do we have any reason to implement the logic in the server side? @dsingal0

@LinPoly
Copy link
Collaborator

LinPoly commented May 6, 2025

FYI: vLLM monitors executor health with daemon thread, it is more structured and reliable for me, but we need to implement similar checking from executor level. @kaiyux

@dsingal0
Copy link
Contributor Author

dsingal0 commented May 6, 2025

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

@LinPoly
Copy link
Collaborator

LinPoly commented May 7, 2025

@LinPoly

re implementing the check server side vs client side
for deployment in kubernetes we need a health and liveness probe to detect when a pod is healthy and ready to be sent traffic during autoscaling and when its unhealthy and needs to be restarted. There is no client code to add this logic to in that case. Currently trtllm-serve exposes /health but that only checks if the server is up, not the health of the runtime. Would be great to have it at the executor/runtime level if possible

So you mean K8S is using server side health check for error detection and auto-scaling? If so I think it is reasonable to add such a WAR, we can work on a more structured solution afterwards. @kaiyux for opinions.

@Superjomn
Copy link
Collaborator

Superjomn commented May 9, 2025

@kaiyux @penli9 , as we have no such check yet, I think a /health_generate is reasonable; we can refine the implementation during further iterations. BTW, we are going to introduce the heartbeat mechanism into the LLM-API level, and that could facilitate the implementation for /health -- it will check and update the status of the system periodically in a cheap way, but I think the /health_generate can always return a real-time status once it is invoked.

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

@dsingal0 Can you also help fix the DCO check? https://github.com/NVIDIA/TensorRT-LLM/pull/3856/checks?check_run_id=41920082141

Summary
Commit sha: 2cd150e, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: a12154a, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.
Commit sha: 06f3d21, Author: Dhruv Singal, Committer: Dhruv Singal; The sign-off is missing.

You can follow the guidance here to sign off https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#signing-your-work.

Please let us know for any questions, thanks!

@dsingal0
Copy link
Contributor Author

dsingal0 commented May 9, 2025

@kaiyux done

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

@dsingal0 Thanks for fixing the DCO check. Mind squashing the commits? Not sure why there were a lot of un-related commits introduced, the diff looks good though.

@dsingal0
Copy link
Contributor Author

dsingal0 commented May 9, 2025

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

@dsingal0
Copy link
Contributor Author

dsingal0 commented May 9, 2025

alternatively, I could open a new PR

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change.

Thanks a lot! I think we should be fine merge this one since we will squash the commits when merge it, @chzblych do you think differently?

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4694 [ run ] triggered by Bot

@kaiyux
Copy link
Member

kaiyux commented May 9, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4705 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4694 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4705 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #3393 completed with status: 'FAILURE'

@kaiyux
Copy link
Member

kaiyux commented May 11, 2025

@dsingal0 Can you help fix the failed test case?

[2025-05-09T14:06:26.867Z] ../../unittest/llmapi/apps/_test_llm_server.py .F....                    [100%]
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z] =================================== FAILURES ===================================
[2025-05-09T14:06:26.867Z] _____________________________ test_health_generate _____________________________
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z] client = <starlette.testclient.TestClient object at 0x7f632c50bf50>
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z]     def test_health_generate(client):
[2025-05-09T14:06:26.867Z]         response = client.get("/health_generate")
[2025-05-09T14:06:26.867Z] >       assert response.status_code == 200
[2025-05-09T14:06:26.867Z] E       assert 404 == 200
[2025-05-09T14:06:26.867Z] E        +  where 404 = <Response [404 Not Found]>.status_code
[2025-05-09T14:06:26.867Z] 
[2025-05-09T14:06:26.867Z] ../../unittest/llmapi/apps/_test_llm_server.py:40: AssertionError

@dsingal0
Copy link
Contributor Author

@kaiyux I believe that fixed it, the test was failing since I had it in the llm api server, not the openai server.

@kaiyux
Copy link
Member

kaiyux commented May 12, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4873 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #4873 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #3531 completed with status: 'FAILURE'

@kaiyux
Copy link
Member

kaiyux commented May 13, 2025

@dsingal0 Can you help fix the style check? : ) I would like to fix it but I noticed the PR is based on your main branch and I think I should not push something onto that.

@kaiyux
Copy link
Member

kaiyux commented May 15, 2025

Hi @dsingal0 , I fixed the style check and cherry-picked the changes to #4349, I marked you as co-author. Please let me know for any questions.

@dsingal0
Copy link
Contributor Author

Amazing, thank you so much @kaiyux

@kaiyux
Copy link
Member

kaiyux commented May 15, 2025

Amazing, thank you so much @kaiyux

No problems, thanks a lot for the great support!

@dsingal0 dsingal0 closed this May 20, 2025
kaiyux added a commit to kaiyux/TensorRT-LLM that referenced this pull request May 21, 2025
Co-authored-by: Dhruv Singal <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]>
kaiyux added a commit that referenced this pull request May 22, 2025
darraghdog pushed a commit to darraghdog/TensorRT-LLM that referenced this pull request Jun 3, 2025
…#3856) (NVIDIA#4349)

Cherry-pick NVIDIA#3856

Signed-off-by: Kaiyu Xie <[email protected]>
Co-authored-by: Dhruv Singal <[email protected]>
Signed-off-by: darraghdog <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Engagement help/insights needed from community Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants