Propagate fatal worker errors #188
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What was changed
Workerto cancel the current task akin to what https://docs.python.org/3/library/asyncio-task.html#asyncio.timeout does. However, that is only in 3.11 and relies on https://docs.python.org/3/library/asyncio-task.html#asyncio.Task.uncancel to prevent inadvertent nesting/re-raise. We don't get that benefit, but I have observed no issues with the current implementation yet.Worker.is_runningandWorker.is_shutdown. We could have a "status"/"state" enum like TS, but this was simpler for now and doesn't keep us from doing so in the future if someone needs to know, for example, it's in the process of shutting down.Worker.runto be cancel-safe and to re-raise any polling errors returned by coreWorker.shutdownto be idempotent and even be able to be called beforerunstarts to alleviate issues like panic ifRunandStoprace sdk-go#868, Worker fatal error can cause double worker stop sdk-go#903, etcon_fatal_errorcallback that users can register. This doesn't have much value since we throw out ofrunanyways, but the nice thing is this one is called before we start the shutdown process if they want to do anything (I can't imagine what though).Core does not have a way to cause an immediate fatal error (everything is retried for a minute) so I mock core in my tests, but I have confirmed that this does actually propagate the error after a little while if, say, you delete the namespace with the operator service.
Checklist