Sagas need better handling of undo actions that fail

The current implementation of sagas [unwraps any failures](https://github.com/oxidecomputer/steno/blob/578498bbb43f9061c636b938dff98e7ece0b0efc/src/saga_exec.rs#L1014) from an undo action. This is not great for distributed systems where the saga actions cannot always control the state of the system their operating on. For example, one might run a saga recovery with a different version of software than ran the saga in the first place. In these cases, we'd probably like to design more nuanced error-handling that distinguishes types of such operational errors, indicates whether they're fatal or retryable, and maybe more.

It's also not clear how sagas handle invariants that they would like to `assert`. This would normally just abort/unwind the program, according to the disposition it was built with. One could imagine catching these and having some policy around retrying the operations, potentially up to some count, specified at creation time. It'll take some care to make sure we don't block multiple sagas, or worse, prevent those later sagas from ever running to completion if an earlier one fails.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sagas need better handling of undo actions that fail #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sagas need better handling of undo actions that fail #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions