Skip to content

Conversation

@steffenlarsen
Copy link
Contributor

@steffenlarsen steffenlarsen commented Oct 3, 2025

This commit makes the following changes to the behavior of asynchronous exception handling:

  1. The death of a queue should not consume asynchronous exceptions.
  2. Calling wait_and_throw on an event after the associated queue has died should still consume exceptions that were originally associated with the queue. This should respect the async_handler priority to the best of its ability.
  3. Calling wait_and_throw or throw_asynchronous on a queue without an async_handler should fall back to using the async_handler of the associated context, then the default async_handler if none were attached to the context.
  4. Throwing asynchronous exceptions from anywhere will now consume all unconsumed asynchronous exceptions previously reported, no matter the event/queue/context/device.

Additionally, this lays the ground work for
#20266 by moving the tracking of unconsumed asynchronous exception to the devices.

This commit makes the following changes to the behavior of asynchronous
exception handling:

 1. The death of a queue should not consume asynchronous exceptions.
 2. Calling wait_and_throw on an event after the associated queue has
    died should still consume exceptions that were originally associated
    with the queue. This should respect the async_handler priority to
    the best of its ability.
 3. Calling wait_and_throw or throw_asynchronous on a queue without an
    async_handler should fall back to using the async_handler of the
    associated context, then the default async_handler if none were
    attached to the context.

Additionally, this lays the ground work for
intel#20266 by moving the tracking of
unconsumed asynchronous exception to the devices.

Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
@@ -0,0 +1,217 @@
// RUN: %{build} -o %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the tests, and they look good. Thanks!

Should we consider adding this test (custom_async_handler.cpp) to the CTS instead? Evidently, the CTS isn't covering this, or we would have seen a failure before. I think the other test (default_async_handler.cpp) cannot be added to the CTS because it tests our implementation-specific behavior of the default async handler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it!

Signed-off-by: Larsen, Steffen <[email protected]>
wait();

if (std::shared_ptr<queue_impl> SubmittedQueue = MSubmittedQueue.lock())
if (std::shared_ptr<queue_impl> SubmittedQueue = MSubmittedQueue.lock()) {
Copy link
Contributor

@KseniyaTikhomirova KseniyaTikhomirova Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event::wait_and_throw() description states
"At least all unconsumed [asynchronous errors](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#async-error) held by queues (or their associated contexts) which were used to enqueue commands associated with this event and any dependent events, are passed to the appropriate [async_handler](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#async-handler) as described in [Section 4.13.1.3](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#subsubsec:async.handler.priorities)."

Does it sound like we should somehow handle errors from more than 1 queue if we have some cross-queue/context dependencies? I get it as we should have even more complex algorithm.
I understand that our former impl is not accurate at all and also reports errors for one queue only but I am curious - should we be even more precise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye! I think your reading is right, though it may be overkill to handle it in this PR. In fact, I worry it'll be awkward to handle in general, but definitely something we should investigate if @gmlueck also agrees with that reading of the spec.

Copy link
Contributor

@KseniyaTikhomirova KseniyaTikhomirova Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it sounds pretty complex to handle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right that this is what the spec says. If you call event::wait_and_throw on some event E1, I think we are required to report async errors for all ancestor events too, following the dependency chain from E1. On the surface, that sounds pretty complex. Is this something we can reasonably implement without adding a ton of bookkeeping overhead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is no simple and lightweight way to do that. It will have to track all dependency graph so I'd prefer to avoid handling of it.
I am also concerned if it is even informative. For example if we have E1 of command submitted to Q1 and E2 of command submitted to Q2 and E2 depends on E1. till the moment user calls E2.wait_and_throw() user can submit other commands to Q1 that may fail and produce async errors. If I get it right Spec state that we should report all unconsumed async errors held by queues participating in the submissions of E2 and its dependencies. But doesn't require us to report only related errors (I believe because it was clear how much handling it require). But in this case it may doesn't have sense because reported errors may have no relation to E2 at all. Then this data may be more confusing than useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to say for sure what the CUDA behavior is here without testing it on a multi-device system, but my reading of the documentation is that cudaGetLastError would return errors from all devices. My guess is it works like this for a cross-device dependency:

  • Suppose the application does cudaEventSynchronize on event E1 which indirectly depends on event E2 to complete. Suppose E1 is recording a stream on device D1 and E2 is recording a stream on device D2.
  • When E2 completes, I guess that any errors will be captured in the context for device D2.
  • When E1 completes, I guess that any errors will be captured in the context for device D1.
  • Suppose the application calls cudaGetLastError after cudaEventSynchronize returns.
  • The description of cudaGetLastError says that it returns the last error "produced by any of the runtime calls in the same instance of the CUDA Runtime library". This doesn't say anything about a particular device or context, so I assume it checks for errors across all devices and contexts.

Your proposal to store errors per-device but still report them all via event::wait_and_throw sounds reasonable. However, I think we want to avoid looping over all devices / platforms on each call to wait_and_throw. To avoid this, we could have a global boolean flag that tells whether there is any async error anywhere. Thus, you would only need to check a single boolean flag in the call to wait_and_throw.

It is also an option to go back to the Khronos workgroup and propose a change to event::wait_and_throw. If we want this, we should have a good justification about why it is hard to implement the currently specified behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your proposal to store errors per-device but still report them all via event::wait_and_throw sounds reasonable. However, I think we want to avoid looping over all devices / platforms on each call to wait_and_throw. To avoid this, we could have a global boolean flag that tells whether there is any async error anywhere. Thus, you would only need to check a single boolean flag in the call to wait_and_throw.

That could be an option, though we would have to check and set it atomically, so it won't be free either, but maybe it would indeed be better than iterating the full device set. I am not too worried about the price incurred by this at the time of asynchronous exception throwing, so it would only really be the atomic read. Would we want it per-platform or fully global?

It is also an option to go back to the Khronos workgroup and propose a change to event::wait_and_throw. If we want this, we should have a good justification about why it is hard to implement the currently specified behavior.

It depends? Does what we are currently discussing seem within the bounds of the expected behavior? If not, I would argue that it would be better to not require the asynchronous exception tracking to be dependent keeping track of the entire dependency chain. If the runtime can live without that sort of book-keeping in general, it seems silly to require it to have specialized book-keeping for something like asynchronous exceptions, which may be on the critical path of a lot of applications (through wait_and_throw.) I also don't think it's a crazy idea to say that when throwing asynchronous exceptions, everything is flushed. That is, a user calling any throw_asynchronous will cause all asynchronous exceptions to start triggering. Would that be too aggressive? Seems simpler to me, but I am also not sure what application code using async errors expect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think it's a crazy idea to say that when throwing asynchronous exceptions, everything is flushed. That is, a user calling any throw_asynchronous will cause all asynchronous exceptions to start triggering.

I also think this makes sense, and this is what I was advocating above. I think the current spec wording allows this. Do you disagree?

I am not too worried about the price incurred by this at the time of asynchronous exception throwing, so it would only really be the atomic read. Would we want it per-platform or fully global?

I agree that performance is not that important when an async error is thrown. The critical thing is that the check to see if there are any async errors needs to be fast. As you say, this will be on the critical path for any application that calls wait_and_throw. It seems to me that the fastest way to do this is to maintain one global boolean flag that tells whether there are any async errors at all. That way the call to wait_and_throw will just need one atomic read to tell whether there are any. Of course, I'm not looking at the implementation, so you might have some better ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current spec wording allows this. Do you disagree?

I agree with that reading. It is a little loose, but it seems to offer us that freedom.

It seems to me that the fastest way to do this is to maintain one global boolean flag that tells whether there are any async errors at all. That way the call to wait_and_throw will just need one atomic read to tell whether there are any. Of course, I'm not looking at the implementation, so you might have some better ideas.

If all waits will just flush all async exceptions, we could just store all of them globally. That way, the implementation can just get the lock and empty out the exceptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR has been updated to now store the exceptions globally (in the scheduler) and flush them all when asynchronous exceptions are to be thrown from anywhere.

Signed-off-by: Larsen, Steffen <[email protected]>
This commit removes a test case where the default handler would be used
if the queue dies. This test case was unreliable as there was no way to
ensure that the runtime was not keeping the queue alive for internal
reasons.

Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
@gmlueck
Copy link
Contributor

gmlueck commented Oct 28, 2025

steffenlarsen requested review from KseniyaTikhomirova and gmlueck 1 hour ago

There's no spec change in this PR, and I don't usually review code. What would you like me to review?

It turns out that there is an update from the Khronos WG discussions. The WG has agree that calling event::wait_and_throw or event::throw_asynchronous will produce UB if there are any asynchronous errors that need to be reported. This was a compromise decision, that was not by first choice. Our original plan to call the default async handler in this scenario is still valid, so we can still do that if we want. Another option is to trigger an assert if there are any unreported async errors in this scenario.

@steffenlarsen
Copy link
Contributor Author

There's no spec change in this PR, and I don't usually review code. What would you like me to review?

I just want to make sure we are one the same page about the newest approach. Approval is not strictly needed, but the last sanity-check would be appreciated.

Copy link
Contributor

@gmlueck gmlueck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume one of the other reviewers will review the code. I approve of the direction summarized here.

Signed-off-by: Larsen, Steffen <[email protected]>
Copy link
Contributor

@KseniyaTikhomirova KseniyaTikhomirova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@steffenlarsen steffenlarsen merged commit e1962d2 into intel:sycl Nov 3, 2025
31 of 32 checks passed
@KornevNikita
Copy link
Contributor

@steffenlarsen since this Nightly sycl-cts/test_event started to fail on all platforms. I checked locally, looks like this patch is guilty. Could you please check?

@steffenlarsen
Copy link
Contributor Author

Thank you for making me aware of this new failure, @KornevNikita! Fix is in KhronosGroup/SYCL-CTS#1155.

@steffenlarsen steffenlarsen deleted the steffen/fix_async_exception_behavior branch November 10, 2025 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants