Skip to content

Conversation

@cretz
Copy link
Member

@cretz cretz commented Nov 30, 2022

What was changed

Checklist

  1. Closes Clarify story around sync activity cancellation #146

@cretz cretz requested a review from a team November 30, 2022 23:48
@bergundy
Copy link
Member

bergundy commented Dec 1, 2022

Looks like worker shutdown is disconnected from thread raise. I would merge it with cancellation so activity authors have one less thing to worry about. In TS worker shutdown notifications are delivered via cancellation.

@cretz
Copy link
Member Author

cretz commented Dec 1, 2022

Looks like worker shutdown is disconnected from thread raise. I would merge it with cancellation so activity authors have one less thing to worry about. In TS worker shutdown notifications are delivered via cancellation.

Only slightly disconnected. Just like in TS, worker shutdown in Python is also delivered via cancellation, it's just that we offer a grace period before that where we populate an event. But the grace period by default is 0 anyways and most don't use it. It's only for advanced uses needing to know worker shutdown before the cancel occurs. We don't throw an exception or cancel the async task during this grace period.

@bergundy
Copy link
Member

bergundy commented Dec 1, 2022

Only slightly disconnected. Just like in TS, worker shutdown in Python is also delivered via cancellation, it's just that we offer a grace period before that where we populate an event. But the grace period by default is 0 anyways and most don't use it. It's only for advanced uses needing to know worker shutdown before the cancel occurs. We don't throw an exception or cancel the async task during this grace period.

I see, in TS we immediately cancel the activities and wait for the grace period before throwing an exceptions in case activities don't respond to shutdown.
I think that's more practical than having the worker stuck waiting on non-responsive activities and traditionally what I would expect from graceful shutdown (same goes for HTTP servers), at some point you have to give up.

Again, not for this PR, it came up because I saw the worker shutdown event while doing the code review.

@cretz
Copy link
Member Author

cretz commented Dec 1, 2022

I see, in TS we immediately cancel the activities and wait for the grace period before throwing an exceptions in case activities don't respond to shutdown.

The whole reason for the grace period is so that activities can react differently than to cancellation. This concept also is in the Go SDK and people do ask for it and leverage it.

in case activities don't respond to shutdown. I think that's more practical than having the worker stuck waiting on non-responsive activities

Hrmm, not sure. We simply don't shutdown in Python if activities don't respond to the cancellation. Same for Go (and maybe same for Java too, unsure).

at some point you have to give up.

Not true. A user can choose to give up, but we don't have to and don't in multiple SDKs. In multithreaded/async situations in many languages, you can't force a stop from the outside, you can only request it. So we guarantee no leaks after "shutdown" returns. But if we were to return from shutdown without guaranteeing the threads/tasks actually finished (e.g. they catch and swallow cancel/interrupt), you are basically encouraging leaks.

@bergundy
Copy link
Member

bergundy commented Dec 1, 2022

I see, in TS we immediately cancel the activities and wait for the grace period before throwing an exceptions in case activities don't respond to shutdown.

The whole reason for the grace period is so that activities can react differently than to cancellation. This concept also is in the Go SDK and people do ask for it and leverage it.

They can do this with the cancellation approach, it's just a matter of checking the cancellation reason if they need to customize the logic.

in case activities don't respond to shutdown. I think that's more practical than having the worker stuck waiting on non-responsive activities

Hrmm, not sure. We simply don't shutdown in Python if activities don't respond to the cancellation. Same for Go (and maybe same for Java too, unsure).

I don't know what Java does here either but I'm not convinced this is the right approach.

at some point you have to give up.

Not true. A user can choose to give up, but we don't have to and don't in multiple SDKs. In multithreaded/async situations in many languages, you can't force a stop from the outside, you can only request it. So we guarantee no leaks after "shutdown" returns. But if we were to return from shutdown without guaranteeing the threads/tasks actually finished (e.g. they catch and swallow cancel/interrupt), you are basically encouraging leaks.

User can set the shutdown grace period to some very high number if they want to avoid the leak.
The whole point of a grace period is that after it ends things become ungraceful even if that means that resources leak.
Users might have other (non Temporal) steps in their shutdown sequence that they want to ensure they have enough time to execute before their process gets killed (for example if being evicted from a spot instance). The grace period ensures that shutdown duration is bounded, without it, users need to wrap the worker run call with their own timer so they can continue the shutdown process.

@bergundy
Copy link
Member

bergundy commented Dec 1, 2022

From what I've seen in HTTP servers, there's typically a grace period for draining all of the existing connections after which the server will forcefully shut down.

@cretz
Copy link
Member Author

cretz commented Dec 1, 2022

From what I've seen in HTTP servers, there's typically a grace period for draining all of the existing connections after which the server will forcefully shut down.

This confuses libraries with processes I think. If we're talking about a Temporal worker process, yes we should give up and kill the process. But we're talking about a library here. The giving up and killing of the process is the user's choice, not ours. Giving up and returning (or erroring) as though the user can continue on without leaks would be bad.

https://pkg.go.dev/net/http#Server.Shutdown for example, the Go HTTP server, would never give up unless the user asks them to. If you have an HTTP handler doing work forever and you don't timeout the call from the caller side, that call will never return.

@bergundy
Copy link
Member

bergundy commented Dec 1, 2022

From what I've seen in HTTP servers, there's typically a grace period for draining all of the existing connections after which the server will forcefully shut down.

This confuses libraries with processes I think. If we're talking about a Temporal worker process, yes we should give up and kill the process. But we're talking about a library here. The giving up and killing of the process is the user's choice, not ours. Giving up and returning (or erroring) as though the user can continue on without leaks would be bad.

I don't think there's any confusion, by giving the user a configurable grace period we make it the user's choice.

https://pkg.go.dev/net/http#Server.Shutdown for example, the Go HTTP server, would never give up unless the user asks them to. If you have an HTTP handler doing work forever and you don't timeout the call from the caller side, that call will never return.

From that page:

If the provided context expires before the shutdown is complete, Shutdown returns the context's error, otherwise it returns any error returned from closing the Server's underlying Listener(s).

That's basically the graceful period that we have in TS.
If context was a thing in JS I'd use that instead of adding a graceful period option.

@cretz
Copy link
Member Author

cretz commented Dec 1, 2022

That's basically the graceful period that we have in TS.

Why have that graceful period then? Why not let the user just abandon the promise after their own timeout? If you're just abandoning shutdown, the user can too.

If context was a thing in JS I'd use that instead of adding a graceful period option.

Cancellation/abandoning the shutdown call is an option for users too. They can easily asyncio.wait_for(worker.shutdown(), 10) to wait only a max of 10s to shutdown. This is the equivalent of the context in Go.

@cretz cretz merged commit 6a43919 into temporalio:main Dec 1, 2022
@cretz cretz deleted the sync-activity-raise branch December 1, 2022 14:48
@bergundy
Copy link
Member

bergundy commented Dec 2, 2022

Why have that graceful period then? Why not let the user just abandon the promise after their own timeout? If you're just abandoning shutdown, the user can too.

We do some cleanup but I agree there's probably not that much value there.

Cancellation/abandoning the shutdown call is an option for users too. They can easily asyncio.wait_for(worker.shutdown(), 10) to wait only a max of 10s to shutdown. This is the equivalent of the context in Go.

In TS SDK you can't wait for shutdown, it doesn't return a promise and I'm not convinced it should.
Graceful shutdown is there to save users the headache of figuring out how to implement this themselves, it's not as simple as asyncio.wait_for in TS, sort of like runUntil it's not trivial to implement but it's super useful.

@bergundy
Copy link
Member

bergundy commented Dec 2, 2022

I understand that in Python there might be less value for implementing the version of graceful shutdown that TS has.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify story around sync activity cancellation

2 participants