-
Notifications
You must be signed in to change notification settings - Fork 1.8k
feat: add health_generate route to openai serving #3856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@kaiyux wdyt? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the added entrypoint function, it is acceptable to check if the runtime/executor/engine is alive. But similar logic can also be added to the client code instead of the server side, set a timeout for a simple request, then check if there is any response returned by the server and the response status. Do we have any reason to implement the logic in the server side? @dsingal0
FYI: vLLM monitors executor health with daemon thread, it is more structured and reliable for me, but we need to implement similar checking from executor level. @kaiyux |
|
So you mean K8S is using server side health check for error detection and auto-scaling? If so I think it is reasonable to add such a WAR, we can work on a more structured solution afterwards. @kaiyux for opinions. |
@kaiyux @penli9 , as we have no such check yet, I think a |
@dsingal0 Can you also help fix the DCO check? https://github.com/NVIDIA/TensorRT-LLM/pull/3856/checks?check_run_id=41920082141
You can follow the guidance here to sign off https://github.com/NVIDIA/TensorRT-LLM/blob/main/CONTRIBUTING.md#signing-your-work. Please let us know for any questions, thanks! |
@kaiyux done |
@dsingal0 Thanks for fixing the DCO check. Mind squashing the commits? Not sure why there were a lot of un-related commits introduced, the diff looks good though. |
@kaiyux I think rebasing to add signoff added those commits to the PR, can we squash on merge instead? I added the test and removed the mtp.py change. |
alternatively, I could open a new PR |
/bot run |
PR_Github #4694 [ run ] triggered by Bot |
/bot run |
PR_Github #4705 [ run ] triggered by Bot |
PR_Github #4694 [ run ] completed with state |
PR_Github #4705 [ run ] completed with state |
@dsingal0 Can you help fix the failed test case?
|
@kaiyux I believe that fixed it, the test was failing since I had it in the llm api server, not the openai server. |
/bot run |
PR_Github #4873 [ run ] triggered by Bot |
PR_Github #4873 [ run ] completed with state |
@dsingal0 Can you help fix the style check? : ) I would like to fix it but I noticed the PR is based on your main branch and I think I should not push something onto that. |
Amazing, thank you so much @kaiyux |
No problems, thanks a lot for the great support! |
Co-authored-by: Dhruv Singal <[email protected]> Signed-off-by: Kaiyu Xie <[email protected]>
…#4349) Cherry-pick #3856 Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Dhruv Singal <[email protected]>
…#3856) (NVIDIA#4349) Cherry-pick NVIDIA#3856 Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: Dhruv Singal <[email protected]> Signed-off-by: darraghdog <[email protected]>
Adds /health_generate for kubernetes to detect when the runtime is hung so it can restart the container/pod even if the server /health returns a 200.
This is helpful in cases when a long context request from a user has caused the llm runtime to hang or die but the fastapi server is still running.