-
Couldn't load subscription status.
- Fork 6.8k
Better constructor failure handling in ray serve #16922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Class to help ServeController to manage desirable states of replica. | |
| Class to help ServeController to manage desired states of replica. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a leaky abstraction, the goal manager has nothing to do with managing replica states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It be safer to do self._pending_goals.get(goal_id, asyncio.Future()).set_exception(e) to avoid some situation where this Except block is called when the future is removed early?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're suggesting .. but I wonder if that adds value to create a new Future instance out of event loop's API since no one should be waiting for its return value. Following pattern from current implementation we can get the actual future obj before try catch, thus we're always awaiting and setting exception on the same obj regardless if it's already poped from pending goals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want to do is get a reference to the future outside the try-except then use it instead of doing another lookup in the dictionary:
fut = self._pending_goals[goal_id]
try:
await fut
except Exception as e:
fut.set_exception(e)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is going to work? because await self._pendings[goal_id] means the future is already completed and has a result. you can't set_exception on a completed future.
this block should just be await self._pending_goals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the actual behavior for this file is not surfaced yet due to this commit directly called delete_deployment, but i totally agree with comments here. Let me change this a bit with prints as test plan on this PR so we can see a complete flow of what happened.
Just to confirm, i'm proposing the ideal state for async goals in case of constructor failure should be:
- We throw exception from backend_worker's async function, which goes through backend_state_manager first;
- But then we should let the exception surface to controller's async goal manager eventually, thus able to set the future tied up to original goal_id, therefore be able to surface it back to user script that called deploy() since the goal_id the represents blocking await for deploying returned with exception
Does this make sense as our deploy() API error handling ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do:
replica.set_should_stop(0)
replicas.add(ReplicaState.SHOULD_STOP, replica)
You may need to add some sort of 'if already dead logic' to BackendReplica::stop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think deleting just the replica is a good idea because it is definitely possible for a constructor to fail at first, but succeed at a later time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I tried this for an earlier PR as well but it didn't play nicely without leveraging asyncio.Future to set exceptions. Let me give it another try, it's definitely preferable to only take action on replica level rather than shutting down entire deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there needs more changes in addition to marking its state as SHOULD_STOP since we still thinks target_replica should be 1, thus still causes it to loop forever. Let me send a followup commit to further change target_replica to 0, and let update() shut it down through controller's event loop.
python/ray/serve/backend_worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move this break to directly after await sync_to_async(_callable.__init__)(*init_args)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall looks pretty good! Would it be easy to test this with an actor that fails at first, but succeeds later? It could be something as simple as using the presence of a file to determine if this is the first or second call:
def __init__(self):
if os.path.exists("file.name"):
return True
open("file.name", "w") as f:
f.write("ONE")
raise RuntimeError("First attempt is sad")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a leaky abstraction, the goal manager has nothing to do with managing replica states
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want to do is get a reference to the future outside the try-except then use it instead of doing another lookup in the dictionary:
fut = self._pending_goals[goal_id]
try:
await fut
except Exception as e:
fut.set_exception(e)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to return a boolean indiciating constructor failure rather than using the exception as the signaling mechanism here. That more closely matches the and would avoid errors where there is an exception in this code that trigger us to treat it as a constructor failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That indeed matches the current exception as signaling mechanism that I saw from current codebase, and i would like to keep it consistent :)
I wonder if there're better ways to throw initial exception to deploy() call's wait_for_goal() however, I assume ideally we want to be able to surface and set the original exception from constructor to the value of goal_id (which is an asyncio.Future now) ? But update() in backend_state handles it initially and we will lose it before reaching user code. Maybe we can have a mix of controller level global exception handling for cases like this that we want to set exception to the goal_id we're waiting for ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, don't we also need to somehow set the exception in the goal manager here? Otherwise how will this get propagated to the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think this is going to work? because await self._pendings[goal_id] means the future is already completed and has a result. you can't set_exception on a completed future.
this block should just be await self._pending_goals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and move set_exception here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I think it's better to have a single place to complete a goal. Maybe we should also let the caller pass in the right exception object instead of using a bare Exception?
|
18dbab3 to
1211860
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm what's the difference between this and RUNNING?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running means the deployment is available with at least 1 replica running with desired model version (technically we can also return deploy() so it doesn't block user code, and let controller to scale to target # eventually); while RUNNING_HEALTHY suggests it reached final desirable state such that no further action needed.
- They can have better names and it's mostly draft for now
- They don't necessarily need to exposed to user, but rather an enum to keep track of total possible internal states of a running deployment, which we already sort of do in the logic of update()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it's kind of weird that these states are hierarchical/composable (e.g., what does it mean that it's both RUNNING and UPGRADING?). Can we make it so that a deployment is only in a single state at any given moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry for the confusion, i meant it should be only one at a time. Line 52 primarily means all subsequent states requires the deployment to at least reach RUNNING state first so deployment states have its own transition DAG and precedence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense. I think adding the DeploymentState here is fine but I'd prefer to avoid adding states that are not yet being used at all (in general I'm unilaterally opposed to checking in "dead code").
python/ray/serve/backend_worker.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we should probably do this retrying in the controller level instead of inside the replica given that we'll likely need retries there anyways and it'd be best to only have it in one place. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's probably a better idea with current setup now to introduce deployment level states and per-replica status reporting. We will need to keep track of each replica's retry count and make decision to abort if all replicas failed to start, while they're probably newly spawned replicas retrying, ex:
Target replica 10, we had total of 10 unique replicas failed 3 times without ever reaching to running state, that's when we can abort rather than spawning replica #11; But if we had 1 running replica, even we accumulated 10 total replicas failed to start, it shouldn't prevent us from spawning #11 since it has reached "RUNNING" state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add details about the failure handling logic here. What is the behavior and what are the guarantees if a goal is errored? How should the client handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping
|
@jiaodong I left a few high-level comments after a quick scan. A full review will take me some more time. Is this in a working state btw? |
thanks for the comments, it's in a working state as i updated console outputs in test plan in a gist, but need more polishing and probably better to move replica retry on backend state / controller level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments, mostly nits but one concern about the behavior: we're marking the deployment as successful when any replica succeeds to start. We should only be returning success when it failed or is fully successful.
python/ray/serve/backend_state.py
Outdated
| Returns: | ||
| (ready, past_slow_startup_threshold) | ||
| state (ReplicaState): Most recent state of replica by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| state (ReplicaState): Most recent state of replica by | |
| state (ReplicaStartingStatus): Most recent state of replica by |
python/ray/serve/backend_state.py
Outdated
| STOPPING = 5 | ||
|
|
||
|
|
||
| class ReplicaStartingStatus(Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| class ReplicaStartingStatus(Enum): | |
| class ReplicaStartupStatus(Enum): |
nit
python/ray/serve/backend_state.py
Outdated
| f"*not* {self._state}.") | ||
|
|
||
| if self._actor.check_ready(): | ||
| start_status = self._actor.check_ready() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| start_status = self._actor.check_ready() | |
| status = self._actor.check_ready() |
python/ray/serve/backend_state.py
Outdated
| self._replica_failed_to_start_counter: Dict[BackendTag, | ||
| int] = defaultdict(int) | ||
| # Keep track of backend info in case of failed deploy() rollback | ||
| self._cur_backend_info: Dict[BackendTag, BackendInfo] = dict() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this a duplicate of self._backend_metadata?
python/ray/serve/backend_state.py
Outdated
| if backend_tag not in self._replicas: | ||
| self._replicas[backend_tag] = ReplicaStateContainer() | ||
|
|
||
| # Set deploy() retry counter and previous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
didn't finish this sentence?
python/ray/serve/backend_state.py
Outdated
| (complete_goal_ids, failed_goal_ids) = self._check_completed_goals() | ||
|
|
||
| for backend_tag, goal_id in complete_goal_ids: | ||
| self._replica_failed_to_start_counter[backend_tag] = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be setting this counter to -1 as soon as any replica successfully starts up? Not sure why we're doing it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's one way of doing it too. It's doing it here in this revision because we makes a call when retry cap is reached and consider deployment as successful and this is where we handle goal ids that reached terminal state.
I think technically this counter can be set in different places to be considered correct, purely depends on how & where we make the call.
python/ray/serve/backend_state.py
Outdated
| logger.error("Reverting backend {backend_tag} to previous " | ||
| f"backend info. {self._prev_backend_info}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging the backend info object will not be useful to users. How about we log the version number instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also "backend" should be "deployment" in user-facing logs (confusing to keep track, I know)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think mostly because we don't have repr for BackendInfo as i was expecting it will somehow print high-level description of the deployment setup .. but version number makes more sense for sure.
python/ray/serve/backend_state.py
Outdated
| logger.error("Reverting backend to previous backend " | ||
| f"info by deleting backend {backend_tag}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confusing error message. just say "initial deployment failed, deleting the deployment"
| return self.healthy | ||
|
|
||
|
|
||
| class MockAsyncGoalManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it strange that we need to mock this class, I'd prefer if we could avoid needing to do this (by not using asyncio.Future unless necessary)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :) Just a few last nits and looks good to merge!
python/ray/serve/backend_state.py
Outdated
| # import ipdb | ||
| # ipdb.set_trace() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
| await async_goal.event.wait() | ||
| logger.debug( | ||
| f"Waiting for goal {goal_id} took {time.time() - start} seconds") | ||
|
|
||
| if async_goal.exception is not None: | ||
| return async_goal.exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: consider adding a method here to avoid directly accessing the goal and exception: return await async_goal.wait()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's my initial attempt too but i ran into quite a few pickling issues if we directly send AysncGoal object over .remote() calls so i only kept it minimal with exception object. But i can change the event within exposed wait() so that we don't need to directly touch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I was thinking that .wait() would just return a single object -- None if success or Exception if it errored
python/ray/serve/backend_state.py
Outdated
| self._goal_manager.complete_goal( | ||
| goal_id, | ||
| RuntimeError( | ||
| "Deployment failed, reverting to previous version " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add deployment name to the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deployment {backend_tag} failed due to constructor failure, reverting to previous version {version} asynchronously.
| RuntimeError(f"Deployment failed, deleting {backend_tag} " | ||
| "asynchronously.")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add deployment name to the log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought backend_tag is the same as deployment name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the log msg is a bit confusing, how about: Deployment {backend_tag} failed due to constructor failure, deleting it asynchronously.
|
Oops, @ijrsvt can you rubber stamp to remove your change request? |
|
Looks like there might have been an actual test failure in test_deploy also: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM !
|
@jiaodong unfortunately the test_constructor_failure test didn't fully run on the PR. You need to add a See here, it ran in 0.7s: https://buildkite.com/ray-project/ray-builders-pr/builds/12312#9dafed28-5c32-4b17-8102-12f5907f5112/333-3318 |
|
@jiaodong I push a commit to make the change ^ |
Why are these changes needed?
Currently when exception is thrown in deployment constructor, which is expected to happen in case of transient failures or import error, we will retry forever in the controller. It's not handled nicely and repeatedly lead to user reported errors.
In addition, it's blocking #15820 that makes Http proxy a regular serve deployment.
Changes in this PR
Common flow of replica status tracking
RayServeWrappedReplica, will throw exception with logged errorActorReplicaWrapper'scheck_ready()will pick it up byray.getand returnReplicaState.FAILED_TO_STARTBackendReplica'scheck_started()will increase deployment retry counter by 1 (if we're still in valid tracking state) and return ReplicaStartingStatus back to update()ReplicaState.RUNNINGstate and proceed as usual. Sample test output: https://gist.github.com/jiaodong/bde8fa2432cb02d8c66934fbf23614a8Common flow of failed to start replica handling
_check_completed_goals(), we factor in # of failed to start replicas and added a few more termination state of a deploy() goal:ReplicaState.FAILED_TO_STARTstate as they failed and have nothing to do with subsequent deploy() goals anymoreA sample deploy() flow, in case of consistent failure on one replica, but eventually at least had one replica running at target version:
A sample deploy() flow, in case of consistent failure on ALL replicas with no replica successfully started:
=== New in latest commit ===
If deploy() failed with exception, we track previous backend info and will call
deploy_backend()if we have previous state, ordelete_backend()if current call is the first one made, thus each deploy() is all or nothing with automatic rollback.Related issue number
Closes #16597
Closes #16114
Checks
Full log
https://gist.github.com/jiaodong/3fd59057bca340a33e774f674479ec6b
scripts/format.shto lint the changes in this PR.