Skip to content

Conversation

@davepacheco
Copy link
Collaborator

@davepacheco davepacheco commented May 24, 2023

For background, see #26. This is a first step towards resolving that. The basic plan here is that if an undo action fails, then the executor will stop dispatching any new tasks, wait for any outstanding tasks to complete, and then come to rest. The result looks similar to what happens if the saga finishes. You can tell from the result type (see SagaResultErr) whether this is what happened.

For an example, see the output from the new smoke test.

I wrestled a little bit about whether "Stuck" should be a new terminal state (separate from "Done"). I tried that at first but it just made the code trickier and wasn't adding anything. It may make sense to separate these better in the final View type but I figured let's see how this plays out in practice.

Here's what's still to-do here:

  • look at removing SagaCachedState::Stuck
  • add a test for recovering a stuck saga (but maybe this isn't meaningful if I do the above)
  • update the changelog
  • finish the Omicron side of the integration (update steno dep to handle undo actions failing omicron#3211)
  • get approval on this PR
  • get approval on Omicron PR (I'll probably wait to get both approvals before landing either one)
  • fix up the version number in Cargo.toml

After this, I think the remaining part of fixing #26 will be to change the return type for undo actions to UndoActionError so that callers are forced to write UndoActionError::PermanentError. I hope that will guide people away from returning an error here out of convenience, since the impact of doing so is significant.

@davepacheco davepacheco requested a review from andrewjstone May 24, 2023 03:45
Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me, and should alleviate some pain in deployments hopefully. I expect we'll get better at writing sagas where undo steps cannot fail over time.

live_state
.undo_errors
.insert(self.node_id, self.state.0.clone())
.expect_none("undo node failed twice (storing error)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: The assert(!live_state.undo_errors.contains_key(&self.node_id)) above is an equivalent assertion. I'd probably remove the former.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call. Fixed.

.undo_errors
.insert(node_id, error.clone())
.expect_none(
"recovered node twice (undo failure case)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above with duplicate assertions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks for catching these.

+-- undone: (subsaga end) (produces "server_alloc")
+-- undone: InstanceConfigure (produces "instance_configure")
+-- undone: VolumeAttach (produces "volume_attach")
+-- failed: InstanceBoot (produces "instance_boot")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify my understanding here: If InstanceBoot fails because of an undo failure, the rest of the actions above it won't actually be undone, right? I got confused, because they are all marked with "undone".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. In this example, InstanceBoot failed because of an action failure. Then, while unwinding, InstanceCreate's undo action failed. So those middle ones (like VolumeAttach) will have been undone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Thanks!

@davepacheco davepacheco marked this pull request as ready for review May 25, 2023 20:21
@davepacheco davepacheco enabled auto-merge (squash) May 25, 2023 20:21
@davepacheco davepacheco merged commit be8cd18 into main May 25, 2023
@davepacheco davepacheco deleted the dap/dont-panic branch May 25, 2023 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants