Skip to content

Conversation

@jiaodong
Copy link
Member

@jiaodong jiaodong commented Jul 7, 2021

Why are these changes needed?

Currently when exception is thrown in deployment constructor, which is expected to happen in case of transient failures or import error, we will retry forever in the controller. It's not handled nicely and repeatedly lead to user reported errors.

In addition, it's blocking #15820 that makes Http proxy a regular serve deployment.

Changes in this PR

Common flow of replica status tracking

  • Register actor defs, target state, return goal id as usual
  • async kick off constructor in RayServeWrappedReplica, will throw exception with logged error
    • ActorReplicaWrapper's check_ready() will pick it up by ray.get and return ReplicaState.FAILED_TO_START
      • BackendReplica's check_started() will increase deployment retry counter by 1 (if we're still in valid tracking state) and return ReplicaStartingStatus back to update()

Common flow of failed to start replica handling

  • In each update() loops's _check_completed_goals(), we factor in # of failed to start replicas and added a few more termination state of a deploy() goal:
    • (1) target_replica_count = running_at_target_version_replica, clear success, complete goal
    • (2) failed_to_start_replica > 0 and running_at_target_version_replica > 0, partial success, complete goal, also clear all tracking replicas in ReplicaState.FAILED_TO_START state as they failed and have nothing to do with subsequent deploy() goals anymore
      • NOTE: It might be worth changing this to something more strict like failed_to_start_replica + running_at_target_version_replica = target_replica_count in case we're still in transition to spawn all replicas
    • (3) failed_to_start_replica >= retry threshold, target_replica_count > 0: clear failure, complete goal with exception.
    • (4) Both success or failure completed goals will clear deployment retry counter as -1 and prevent controller from increasing it further.
  • A sample deploy() flow, in case of consistent failure on one replica, but eventually at least had one replica running at target version:

  • A sample deploy() flow, in case of consistent failure on ALL replicas with no replica successfully started:

=== New in latest commit ===

If deploy() failed with exception, we track previous backend info and will call deploy_backend() if we have previous state, or delete_backend() if current call is the first one made, thus each deploy() is all or nothing with automatic rollback.

Related issue number

Closes #16597
Closes #16114

Checks

Full log

https://gist.github.com/jiaodong/3fd59057bca340a33e774f674479ec6b

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Class to help ServeController to manage desirable states of replica.
Class to help ServeController to manage desired states of replica.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a leaky abstraction, the goal manager has nothing to do with managing replica states

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It be safer to do self._pending_goals.get(goal_id, asyncio.Future()).set_exception(e) to avoid some situation where this Except block is called when the future is removed early?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're suggesting .. but I wonder if that adds value to create a new Future instance out of event loop's API since no one should be waiting for its return value. Following pattern from current implementation we can get the actual future obj before try catch, thus we're always awaiting and setting exception on the same obj regardless if it's already poped from pending goals.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want to do is get a reference to the future outside the try-except then use it instead of doing another lookup in the dictionary:

fut = self._pending_goals[goal_id]
try:
    await fut
except Exception as e:
    fut.set_exception(e)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is going to work? because await self._pendings[goal_id] means the future is already completed and has a result. you can't set_exception on a completed future.
this block should just be await self._pending_goals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the actual behavior for this file is not surfaced yet due to this commit directly called delete_deployment, but i totally agree with comments here. Let me change this a bit with prints as test plan on this PR so we can see a complete flow of what happened.

Just to confirm, i'm proposing the ideal state for async goals in case of constructor failure should be:

  • We throw exception from backend_worker's async function, which goes through backend_state_manager first;
  • But then we should let the exception surface to controller's async goal manager eventually, thus able to set the future tied up to original goal_id, therefore be able to surface it back to user script that called deploy() since the goal_id the represents blocking await for deploying returned with exception

Does this make sense as our deploy() API error handling ?

Comment on lines 867 to 869
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Comment on lines 891 to 964
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines 906 to 980
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines 1005 to 1006
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do:

                    replica.set_should_stop(0)
                    replicas.add(ReplicaState.SHOULD_STOP, replica)

You may need to add some sort of 'if already dead logic' to BackendReplica::stop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think deleting just the replica is a good idea because it is definitely possible for a constructor to fail at first, but succeed at a later time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I tried this for an earlier PR as well but it didn't play nicely without leveraging asyncio.Future to set exceptions. Let me give it another try, it's definitely preferable to only take action on replica level rather than shutting down entire deployment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there needs more changes in addition to marking its state as SHOULD_STOP since we still thinks target_replica should be 1, thus still causes it to loop forever. Let me send a followup commit to further change target_replica to 0, and let update() shut it down through controller's event loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move this break to directly after await sync_to_async(_callable.__init__)(*init_args)

Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks pretty good! Would it be easy to test this with an actor that fails at first, but succeeds later? It could be something as simple as using the presence of a file to determine if this is the first or second call:

def __init__(self):
   if os.path.exists("file.name"):
      return True
    open("file.name", "w") as f:
       f.write("ONE")
    raise RuntimeError("First attempt is sad")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a leaky abstraction, the goal manager has nothing to do with managing replica states

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want to do is get a reference to the future outside the try-except then use it instead of doing another lookup in the dictionary:

fut = self._pending_goals[goal_id]
try:
    await fut
except Exception as e:
    fut.set_exception(e)

Comment on lines 121 to 124
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to return a boolean indiciating constructor failure rather than using the exception as the signaling mechanism here. That more closely matches the and would avoid errors where there is an exception in this code that trigger us to treat it as a constructor failure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That indeed matches the current exception as signaling mechanism that I saw from current codebase, and i would like to keep it consistent :)

I wonder if there're better ways to throw initial exception to deploy() call's wait_for_goal() however, I assume ideally we want to be able to surface and set the original exception from constructor to the value of goal_id (which is an asyncio.Future now) ? But update() in backend_state handles it initially and we will lose it before reaching user code. Maybe we can have a mix of controller level global exception handling for cases like this that we want to set exception to the goal_id we're waiting for ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't we also need to somehow set the exception in the goal manager here? Otherwise how will this get propagated to the future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is going to work? because await self._pendings[goal_id] means the future is already completed and has a result. you can't set_exception on a completed future.
this block should just be await self._pending_goals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and move set_exception here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I think it's better to have a single place to complete a goal. Maybe we should also let the caller pass in the right exception object instead of using a bare Exception?

@edoakes
Copy link
Collaborator

edoakes commented Jul 7, 2021

  • Please add unit tests for the logic to test_backend_state and integration tests to test_deploy.
  • We discussed rolling back to a previous version if an upgrade to a deployment fails, do you plan to handle that in a separate PR?
  • Right now we're treating a single constructor failure as a failed deployment and deleting the deployment if it fails. In some cases this might be an undesirable behavior. For example, if the deployment is scaled up and a single constructor fails, we'd delete the entire deployment (seems very dangerous). At a minimum we should only do this deletion/rollback behavior when the deployment is being created for the first time or upgraded to a new version.

@jiaodong jiaodong changed the title [Ready for review] Better constructor failure handling in ray serve [WIP] Better constructor failure handling in ray serve Jul 7, 2021
@jiaodong jiaodong added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. serve Ray Serve Related Issue labels Jul 7, 2021
@jiaodong jiaodong force-pushed the serve_constructor_new branch from 18dbab3 to 1211860 Compare July 29, 2021 00:44
@jiaodong jiaodong removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jul 29, 2021
@jiaodong jiaodong requested review from edoakes, ijrsvt and simon-mo July 29, 2021 00:50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what's the difference between this and RUNNING?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running means the deployment is available with at least 1 replica running with desired model version (technically we can also return deploy() so it doesn't block user code, and let controller to scale to target # eventually); while RUNNING_HEALTHY suggests it reached final desirable state such that no further action needed.

  1. They can have better names and it's mostly draft for now
  2. They don't necessarily need to exposed to user, but rather an enum to keep track of total possible internal states of a running deployment, which we already sort of do in the logic of update()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's kind of weird that these states are hierarchical/composable (e.g., what does it mean that it's both RUNNING and UPGRADING?). Can we make it so that a deployment is only in a single state at any given moment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for the confusion, i meant it should be only one at a time. Line 52 primarily means all subsequent states requires the deployment to at least reach RUNNING state first so deployment states have its own transition DAG and precedence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, makes sense. I think adding the DeploymentState here is fine but I'd prefer to avoid adding states that are not yet being used at all (in general I'm unilaterally opposed to checking in "dead code").

Comment on lines 73 to 76
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should probably do this retrying in the controller level instead of inside the replica given that we'll likely need retries there anyways and it'd be best to only have it in one place. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably a better idea with current setup now to introduce deployment level states and per-replica status reporting. We will need to keep track of each replica's retry count and make decision to abort if all replicas failed to start, while they're probably newly spawned replicas retrying, ex:

Target replica 10, we had total of 10 unique replicas failed 3 times without ever reaching to running state, that's when we can abort rather than spawning replica #11; But if we had 1 running replica, even we accumulated 10 total replicas failed to start, it shouldn't prevent us from spawning #11 since it has reached "RUNNING" state.

Comment on lines 11 to 31
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add details about the failure handling logic here. What is the behavior and what are the guarantees if a goal is errored? How should the client handle this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping

@edoakes
Copy link
Collaborator

edoakes commented Jul 29, 2021

@jiaodong I left a few high-level comments after a quick scan. A full review will take me some more time. Is this in a working state btw?

@jiaodong
Copy link
Member Author

@jiaodong I left a few high-level comments after a quick scan. A full review will take me some more time. Is this in a working state btw?

thanks for the comments, it's in a working state as i updated console outputs in test plan in a gist, but need more polishing and probably better to move replica retry on backend state / controller level

Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments, mostly nits but one concern about the behavior: we're marking the deployment as successful when any replica succeeds to start. We should only be returning success when it failed or is fully successful.

Returns:
(ready, past_slow_startup_threshold)
state (ReplicaState): Most recent state of replica by
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
state (ReplicaState): Most recent state of replica by
state (ReplicaStartingStatus): Most recent state of replica by

STOPPING = 5


class ReplicaStartingStatus(Enum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class ReplicaStartingStatus(Enum):
class ReplicaStartupStatus(Enum):

nit

f"*not* {self._state}.")

if self._actor.check_ready():
start_status = self._actor.check_ready()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
start_status = self._actor.check_ready()
status = self._actor.check_ready()

self._replica_failed_to_start_counter: Dict[BackendTag,
int] = defaultdict(int)
# Keep track of backend info in case of failed deploy() rollback
self._cur_backend_info: Dict[BackendTag, BackendInfo] = dict()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this a duplicate of self._backend_metadata?

if backend_tag not in self._replicas:
self._replicas[backend_tag] = ReplicaStateContainer()

# Set deploy() retry counter and previous
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't finish this sentence?

(complete_goal_ids, failed_goal_ids) = self._check_completed_goals()

for backend_tag, goal_id in complete_goal_ids:
self._replica_failed_to_start_counter[backend_tag] = -1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we be setting this counter to -1 as soon as any replica successfully starts up? Not sure why we're doing it here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's one way of doing it too. It's doing it here in this revision because we makes a call when retry cap is reached and consider deployment as successful and this is where we handle goal ids that reached terminal state.

I think technically this counter can be set in different places to be considered correct, purely depends on how & where we make the call.

Comment on lines 1177 to 1178
logger.error("Reverting backend {backend_tag} to previous "
f"backend info. {self._prev_backend_info}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging the backend info object will not be useful to users. How about we log the version number instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also "backend" should be "deployment" in user-facing logs (confusing to keep track, I know)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mostly because we don't have repr for BackendInfo as i was expecting it will somehow print high-level description of the deployment setup .. but version number makes more sense for sure.

Comment on lines 1182 to 1183
logger.error("Reverting backend to previous backend "
f"info by deleting backend {backend_tag}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confusing error message. just say "initial deployment failed, deleting the deployment"

return self.healthy


class MockAsyncGoalManager:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it strange that we need to mock this class, I'd prefer if we could avoid needing to do this (by not using asyncio.Future unless necessary)

@jiaodong jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 26, 2021
@jiaodong jiaodong requested a review from edoakes August 26, 2021 06:06
Copy link
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :) Just a few last nits and looks good to merge!

Comment on lines 798 to 799
# import ipdb
# ipdb.set_trace()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove

Comment on lines 156 to 161
await async_goal.event.wait()
logger.debug(
f"Waiting for goal {goal_id} took {time.time() - start} seconds")

if async_goal.exception is not None:
return async_goal.exception
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: consider adding a method here to avoid directly accessing the goal and exception: return await async_goal.wait()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's my initial attempt too but i ran into quite a few pickling issues if we directly send AysncGoal object over .remote() calls so i only kept it minimal with exception object. But i can change the event within exposed wait() so that we don't need to directly touch it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I was thinking that .wait() would just return a single object -- None if success or Exception if it errored

self._goal_manager.complete_goal(
goal_id,
RuntimeError(
"Deployment failed, reverting to previous version "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add deployment name to the log

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deployment {backend_tag} failed due to constructor failure, reverting to previous version {version} asynchronously.

Comment on lines +1202 to +1203
RuntimeError(f"Deployment failed, deleting {backend_tag} "
"asynchronously."))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add deployment name to the log

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought backend_tag is the same as deployment name ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops....

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the log msg is a bit confusing, how about: Deployment {backend_tag} failed due to constructor failure, deleting it asynchronously.

@edoakes
Copy link
Collaborator

edoakes commented Aug 26, 2021

Oops, @ijrsvt can you rubber stamp to remove your change request?

@edoakes
Copy link
Collaborator

edoakes commented Aug 26, 2021

@edoakes
Copy link
Collaborator

edoakes commented Aug 26, 2021

Looks like there might have been an actual test failure in test_deploy also:
https://buildkite.com/ray-project/ray-builders-pr/builds/12257#d0e7ba05-7f57-4e2d-a421-f48351773bab/431-4618

@edoakes edoakes added @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. and removed tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 26, 2021
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@jiaodong jiaodong added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 27, 2021
@edoakes
Copy link
Collaborator

edoakes commented Aug 27, 2021

@jiaodong unfortunately the test_constructor_failure test didn't fully run on the PR. You need to add a if __name__ == "__main__" clause to actually run the tests due to a quirk in our test setup (see any of the other tests for an example).

See here, it ran in 0.7s: https://buildkite.com/ray-project/ray-builders-pr/builds/12312#9dafed28-5c32-4b17-8102-12f5907f5112/333-3318

@edoakes
Copy link
Collaborator

edoakes commented Aug 27, 2021

@jiaodong I push a commit to make the change ^

@edoakes edoakes merged commit c7e38ce into ray-project:master Aug 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

serve Ray Serve Related Issue tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

4 participants