[SYCL] Fix asynchronous exception behavior #20274

steffenlarsen · 2025-10-03T14:12:42Z

This commit makes the following changes to the behavior of asynchronous exception handling:

The death of a queue should not consume asynchronous exceptions.
Calling wait_and_throw on an event after the associated queue has died should still consume exceptions that were originally associated with the queue. This should respect the async_handler priority to the best of its ability.
Calling wait_and_throw or throw_asynchronous on a queue without an async_handler should fall back to using the async_handler of the associated context, then the default async_handler if none were attached to the context.
Throwing asynchronous exceptions from anywhere will now consume all unconsumed asynchronous exceptions previously reported, no matter the event/queue/context/device.

Additionally, this lays the ground work for
#20266 by moving the tracking of unconsumed asynchronous exception to the devices.

This commit makes the following changes to the behavior of asynchronous exception handling: 1. The death of a queue should not consume asynchronous exceptions. 2. Calling wait_and_throw on an event after the associated queue has died should still consume exceptions that were originally associated with the queue. This should respect the async_handler priority to the best of its ability. 3. Calling wait_and_throw or throw_asynchronous on a queue without an async_handler should fall back to using the async_handler of the associated context, then the default async_handler if none were attached to the context. Additionally, this lays the ground work for intel#20266 by moving the tracking of unconsumed asynchronous exception to the devices. Signed-off-by: Larsen, Steffen <[email protected]>

Signed-off-by: Larsen, Steffen <[email protected]>

gmlueck · 2025-10-03T17:18:14Z

sycl/test-e2e/AsyncHandler/custom_async_handler.cpp

@@ -0,0 +1,217 @@
+// RUN: %{build} -o %t.out


I reviewed the tests, and they look good. Thanks!

Should we consider adding this test (custom_async_handler.cpp) to the CTS instead? Evidently, the CTS isn't covering this, or we would have seen a failure before. I think the other test (default_async_handler.cpp) cannot be added to the CTS because it tests our implementation-specific behavior of the default async handler.

I will look into it!

Signed-off-by: Larsen, Steffen <[email protected]>

sycl/source/detail/queue_impl.hpp

sycl/source/detail/device_impl.hpp

KseniyaTikhomirova · 2025-10-13T14:45:57Z

sycl/source/detail/event_impl.cpp

  wait();

-  if (std::shared_ptr<queue_impl> SubmittedQueue = MSubmittedQueue.lock())
+  if (std::shared_ptr<queue_impl> SubmittedQueue = MSubmittedQueue.lock()) {


event::wait_and_throw() description states
"At least all unconsumed [asynchronous errors](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#async-error) held by queues (or their associated contexts) which were used to enqueue commands associated with this event and any dependent events, are passed to the appropriate [async_handler](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#async-handler) as described in [Section 4.13.1.3](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#subsubsec:async.handler.priorities)."

Does it sound like we should somehow handle errors from more than 1 queue if we have some cross-queue/context dependencies? I get it as we should have even more complex algorithm.
I understand that our former impl is not accurate at all and also reports errors for one queue only but I am curious - should we be even more precise?

Good eye! I think your reading is right, though it may be overkill to handle it in this PR. In fact, I worry it'll be awkward to handle in general, but definitely something we should investigate if @gmlueck also agrees with that reading of the spec.

I agree, it sounds pretty complex to handle.

I think you're right that this is what the spec says. If you call event::wait_and_throw on some event E1, I think we are required to report async errors for all ancestor events too, following the dependency chain from E1. On the surface, that sounds pretty complex. Is this something we can reasonably implement without adding a ton of bookkeeping overhead?

I think there is no simple and lightweight way to do that. It will have to track all dependency graph so I'd prefer to avoid handling of it.
I am also concerned if it is even informative. For example if we have E1 of command submitted to Q1 and E2 of command submitted to Q2 and E2 depends on E1. till the moment user calls E2.wait_and_throw() user can submit other commands to Q1 that may fail and produce async errors. If I get it right Spec state that we should report all unconsumed async errors held by queues participating in the submissions of E2 and its dependencies. But doesn't require us to report only related errors (I believe because it was clear how much handling it require). But in this case it may doesn't have sense because reported errors may have no relation to E2 at all. Then this data may be more confusing than useful.

It's hard to say for sure what the CUDA behavior is here without testing it on a multi-device system, but my reading of the documentation is that cudaGetLastError would return errors from all devices. My guess is it works like this for a cross-device dependency:

Suppose the application does cudaEventSynchronize on event E1 which indirectly depends on event E2 to complete. Suppose E1 is recording a stream on device D1 and E2 is recording a stream on device D2.

When E2 completes, I guess that any errors will be captured in the context for device D2.

When E1 completes, I guess that any errors will be captured in the context for device D1.

Suppose the application calls cudaGetLastError after cudaEventSynchronize returns.

The description of cudaGetLastError says that it returns the last error "produced by any of the runtime calls in the same instance of the CUDA Runtime library". This doesn't say anything about a particular device or context, so I assume it checks for errors across all devices and contexts.

Your proposal to store errors per-device but still report them all via event::wait_and_throw sounds reasonable. However, I think we want to avoid looping over all devices / platforms on each call to wait_and_throw. To avoid this, we could have a global boolean flag that tells whether there is any async error anywhere. Thus, you would only need to check a single boolean flag in the call to wait_and_throw.

It is also an option to go back to the Khronos workgroup and propose a change to event::wait_and_throw. If we want this, we should have a good justification about why it is hard to implement the currently specified behavior.

Your proposal to store errors per-device but still report them all via event::wait_and_throw sounds reasonable. However, I think we want to avoid looping over all devices / platforms on each call to wait_and_throw. To avoid this, we could have a global boolean flag that tells whether there is any async error anywhere. Thus, you would only need to check a single boolean flag in the call to wait_and_throw.

That could be an option, though we would have to check and set it atomically, so it won't be free either, but maybe it would indeed be better than iterating the full device set. I am not too worried about the price incurred by this at the time of asynchronous exception throwing, so it would only really be the atomic read. Would we want it per-platform or fully global?

It is also an option to go back to the Khronos workgroup and propose a change to event::wait_and_throw. If we want this, we should have a good justification about why it is hard to implement the currently specified behavior.

It depends? Does what we are currently discussing seem within the bounds of the expected behavior? If not, I would argue that it would be better to not require the asynchronous exception tracking to be dependent keeping track of the entire dependency chain. If the runtime can live without that sort of book-keeping in general, it seems silly to require it to have specialized book-keeping for something like asynchronous exceptions, which may be on the critical path of a lot of applications (through wait_and_throw.) I also don't think it's a crazy idea to say that when throwing asynchronous exceptions, everything is flushed. That is, a user calling any throw_asynchronous will cause all asynchronous exceptions to start triggering. Would that be too aggressive? Seems simpler to me, but I am also not sure what application code using async errors expect.

I also don't think it's a crazy idea to say that when throwing asynchronous exceptions, everything is flushed. That is, a user calling any throw_asynchronous will cause all asynchronous exceptions to start triggering.

I also think this makes sense, and this is what I was advocating above. I think the current spec wording allows this. Do you disagree?

I am not too worried about the price incurred by this at the time of asynchronous exception throwing, so it would only really be the atomic read. Would we want it per-platform or fully global?

I agree that performance is not that important when an async error is thrown. The critical thing is that the check to see if there are any async errors needs to be fast. As you say, this will be on the critical path for any application that calls wait_and_throw. It seems to me that the fastest way to do this is to maintain one global boolean flag that tells whether there are any async errors at all. That way the call to wait_and_throw will just need one atomic read to tell whether there are any. Of course, I'm not looking at the implementation, so you might have some better ideas.

I think the current spec wording allows this. Do you disagree?

I agree with that reading. It is a little loose, but it seems to offer us that freedom.

It seems to me that the fastest way to do this is to maintain one global boolean flag that tells whether there are any async errors at all. That way the call to wait_and_throw will just need one atomic read to tell whether there are any. Of course, I'm not looking at the implementation, so you might have some better ideas.

If all waits will just flush all async exceptions, we could just store all of them globally. That way, the implementation can just get the lock and empty out the exceptions.

The PR has been updated to now store the exceptions globally (in the scheduler) and flush them all when asynchronous exceptions are to be thrown from anywhere.

Signed-off-by: Larsen, Steffen <[email protected]>

This commit removes a test case where the default handler would be used if the queue dies. This test case was unreliable as there was no way to ensure that the runtime was not keeping the queue alive for internal reasons. Signed-off-by: Larsen, Steffen <[email protected]>

…ption_behavior

Signed-off-by: Larsen, Steffen <[email protected]>

gmlueck · 2025-10-28T16:44:46Z

steffenlarsen requested review from KseniyaTikhomirova and gmlueck 1 hour ago

There's no spec change in this PR, and I don't usually review code. What would you like me to review?

It turns out that there is an update from the Khronos WG discussions. The WG has agree that calling event::wait_and_throw or event::throw_asynchronous will produce UB if there are any asynchronous errors that need to be reported. This was a compromise decision, that was not by first choice. Our original plan to call the default async handler in this scenario is still valid, so we can still do that if we want. Another option is to trigger an assert if there are any unreported async errors in this scenario.

steffenlarsen · 2025-10-29T06:29:59Z

There's no spec change in this PR, and I don't usually review code. What would you like me to review?

I just want to make sure we are one the same page about the newest approach. Approval is not strictly needed, but the last sanity-check would be appreciated.

gmlueck

I assume one of the other reviewers will review the code. I approve of the direction summarized here.

sycl/source/detail/event_impl.hpp

sycl/source/detail/scheduler/scheduler.cpp

Signed-off-by: Larsen, Steffen <[email protected]>

KseniyaTikhomirova

LGTM

KornevNikita · 2025-11-07T15:49:54Z

@steffenlarsen since this Nightly sycl-cts/test_event started to fail on all platforms. I checked locally, looks like this patch is guilty. Could you please check?

steffenlarsen · 2025-11-10T07:00:06Z

Thank you for making me aware of this new failure, @KornevNikita! Fix is in KhronosGroup/SYCL-CTS#1155.

steffenlarsen requested a review from a team as a code owner October 3, 2025 14:12

steffenlarsen requested review from KseniyaTikhomirova and gmlueck October 3, 2025 14:12

steffenlarsen had a problem deploying to WindowsCILock October 3, 2025 14:12 — with GitHub Actions Failure

steffenlarsen had a problem deploying to WindowsCILock October 3, 2025 14:38 — with GitHub Actions Failure

Reintroduce the pipe to file

b1dc735

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock October 3, 2025 15:03 — with GitHub Actions Inactive

steffenlarsen had a problem deploying to WindowsCILock October 3, 2025 15:30 — with GitHub Actions Failure

steffenlarsen temporarily deployed to WindowsCILock October 3, 2025 15:30 — with GitHub Actions Inactive

gmlueck reviewed Oct 3, 2025

View reviewed changes

Merge branch 'sycl' into steffen/fix_async_exception_behavior

b00c9e6

steffenlarsen had a problem deploying to WindowsCILock October 13, 2025 07:22 — with GitHub Actions Failure

Fix new failure

ab2e8d1

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock October 13, 2025 07:53 — with GitHub Actions Inactive

steffenlarsen temporarily deployed to WindowsCILock October 13, 2025 08:22 — with GitHub Actions Inactive

KseniyaTikhomirova reviewed Oct 13, 2025

View reviewed changes

Fix comment

f8237d4

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock October 14, 2025 11:55 — with GitHub Actions Inactive

steffenlarsen temporarily deployed to WindowsCILock October 14, 2025 12:24 — with GitHub Actions Inactive

steffenlarsen temporarily deployed to WindowsCILock October 14, 2025 13:53 — with GitHub Actions Inactive

steffenlarsen had a problem deploying to WindowsCILock October 14, 2025 14:22 — with GitHub Actions Failure

steffenlarsen added 2 commits October 16, 2025 03:56

Merge remote-tracking branch 'intel/sycl' into steffen/fix_async_exce…

8389503

…ption_behavior

Fix removed test case run

e76da36

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock October 16, 2025 10:57 — with GitHub Actions Inactive

steffenlarsen temporarily deployed to WindowsCILock October 16, 2025 11:48 — with GitHub Actions Inactive

Switch to global async flushing

5e43acc

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock October 28, 2025 08:05 — with GitHub Actions Inactive

steffenlarsen temporarily deployed to WindowsCILock October 28, 2025 08:38 — with GitHub Actions Inactive

steffenlarsen had a problem deploying to WindowsCILock October 28, 2025 08:38 — with GitHub Actions Failure

steffenlarsen added 2 commits October 28, 2025 03:50

Fix formatting

c395f87

Signed-off-by: Larsen, Steffen <[email protected]>

Fix unittest failures

4a24036

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock October 28, 2025 11:21 — with GitHub Actions Inactive

steffenlarsen temporarily deployed to WindowsCILock October 28, 2025 11:47 — with GitHub Actions Inactive

steffenlarsen requested review from KseniyaTikhomirova and gmlueck October 28, 2025 14:59

gmlueck approved these changes Oct 29, 2025

View reviewed changes

KseniyaTikhomirova reviewed Oct 31, 2025

View reviewed changes

sycl/source/detail/event_impl.hpp Outdated Show resolved Hide resolved

sycl/source/detail/event_impl.hpp Outdated Show resolved Hide resolved

sycl/source/detail/scheduler/scheduler.cpp Outdated Show resolved Hide resolved

steffenlarsen added 2 commits November 2, 2025 23:18

Remove leftovers

0e37140

Signed-off-by: Larsen, Steffen <[email protected]>

Add context to the async exception key

72136b6

Signed-off-by: Larsen, Steffen <[email protected]>

steffenlarsen temporarily deployed to WindowsCILock November 3, 2025 07:56 — with GitHub Actions Inactive

steffenlarsen requested a review from KseniyaTikhomirova November 3, 2025 07:56

steffenlarsen temporarily deployed to WindowsCILock November 3, 2025 08:21 — with GitHub Actions Inactive

steffenlarsen had a problem deploying to WindowsCILock November 3, 2025 08:21 — with GitHub Actions Failure

steffenlarsen temporarily deployed to WindowsCILock November 3, 2025 09:25 — with GitHub Actions Inactive

KseniyaTikhomirova approved these changes Nov 3, 2025

View reviewed changes

steffenlarsen merged commit e1962d2 into intel:sycl Nov 3, 2025
31 of 32 checks passed

steffenlarsen deleted the steffen/fix_async_exception_behavior branch November 10, 2025 07:00

[SYCL] Fix asynchronous exception behavior #20274

[SYCL] Fix asynchronous exception behavior #20274

Uh oh!

Conversation

steffenlarsen commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

KseniyaTikhomirova Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KseniyaTikhomirova Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gmlueck commented Oct 28, 2025

Uh oh!

steffenlarsen commented Oct 29, 2025

Uh oh!

gmlueck left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KseniyaTikhomirova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KornevNikita commented Nov 7, 2025

Uh oh!

steffenlarsen commented Nov 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steffenlarsen commented Oct 3, 2025 •

edited

Loading

KseniyaTikhomirova Oct 13, 2025 •

edited

Loading

KseniyaTikhomirova Oct 14, 2025 •

edited

Loading