Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Aug 26, 2022

Fixes:#74678

There is not a lot of value in gracefully removing the main thread from the thread list. By the time it gets to detaching of the main thread it is typically the only one still running. Besides some thread may be terminated while holding the thread list spinlock, so attempt to do shutdown for the main thread will likely deadlock.

@ghost ghost assigned VSadov Aug 26, 2022
@VSadov VSadov requested a review from jkotas August 26, 2022 19:52
// initialization and false on failure.
REDHAWK_PALEXPORT bool REDHAWK_PALAPI PalInit()
{
g_mainThreadId = GetCurrentThreadId();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization can run on any thread when NativeAOT is used to produce libraries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is inconvenient.

There is no easy way to identify the main thread.
Online samples suggest iterating through all threads and look for the earliest start time.

Copy link
Member Author

@VSadov VSadov Aug 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A part of the problem is that the thread object in NativeAOT lives directly in a TSL slot and thus is owned by the thread, so we must unlink the threads from the list before the thread dies and that requires taking a lock.

@VSadov
Copy link
Member Author

VSadov commented Aug 26, 2022

The alternative solution that I tried was to switch Windows to the same plan as on Unix with TlsDestructionMonitor

That did not work. The reason is that on Windows TLS destructors run while holding system loader lock and we want to call managed cleanup routines, which makes it prone to deadlock.
One scenario:

  • thread is detached while GC starts.
  • TLS destructor tries to call into managed code and reverse PInvoke blocks untill GC ends
  • GC wants to start a background GC thread
  • GC waits in a join for the BGC thread to start.
  • the BGC thread cannot start since the loader lock is taken

// The current fiber is the home fiber of a thread, so the thread is shutting down
RuntimeThreadShutdown(lpFlsData);
// Do not do shutdown for the main thread though.
// other threads are terminated before the main one and could leave TLS locked,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not how shutdown works. The shutdown first terminates all threads except the thread that called ExitProcess. ExitProcess can be called by any thread.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of vague information on the details of process/thread shutdown. It looks like in a completely broad sense it is hard to handle as there is always some crazy scenario such as someone starting new threads in a DLLMain detach handler.

I think we should strive for handling "common" cases.
However using NativeAOT in a library and calling ExitProcess from non-main thread would seem reasonable.

@jkotas
Copy link
Member

jkotas commented Aug 26, 2022

The reason is that on Windows TLS destructors run while holding system loader lock and we want to call managed cleanup routines, which makes it prone to deadlock.

What would it take to not call any managed code during thread exit? Calling managed code during process exit sounds like an endless source of problems.

@VSadov
Copy link
Member Author

VSadov commented Aug 26, 2022

What would it take to not call any managed code during thread exit? Calling managed code during process exit sounds like an endless source of problems.

I agree on this. It seems to be more useful on Unix than on Windows.

On Unix we need to tell the managed wait subsystem to release all locks associated with the thread. We do not have a problem on Unix though.
Let me check if we do much on Windows.

@VSadov
Copy link
Member Author

VSadov commented Aug 26, 2022

There is some stuff on Windows too. Less than on Unix, but we still decrement a number of foreground threads, possibly release an event used in WaitForForegroundThreads.
I wonder if this could be handed over to finalizer thread somehow.

@jkotas
Copy link
Member

jkotas commented Aug 26, 2022

Or just do that via a bit of unmanaged code.

@VSadov
Copy link
Member Author

VSadov commented Aug 26, 2022

It is actually trickier than it seems. We need to interact with managed code eventually (more on Unix, less on Windows) to inform that the thread has stopped, and we need to do it fairly promptly upon the thread's exit, thus finalizer might not work well due to unknown latency.

I see how the current scheme was attractive - as long as the calling code does not hold system locks, calling a managed helper from the native callback is a natural solution, not much different from other cases when managed code is called from native.
It looks like fiber detach does not hold locks so it was chosen as such callback.
The only problem is that a thread may be terminated inside the fiber callback during shutdown.

@VSadov
Copy link
Member Author

VSadov commented Aug 26, 2022

The only problem is that a thread may be terminated inside the fiber callback during shutdown.

and not just that. If all threads were killed during shutdown, it would probably be ok, but the last one is not killed, so if it blocks, the process does not exit.

@VSadov
Copy link
Member Author

VSadov commented Aug 26, 2022

Another possibility is to keep a count of attached threads and if we are in shutdown mode, do not perform cleanup for the last one.

@VSadov
Copy link
Member Author

VSadov commented Aug 27, 2022

I think we can specialcase the thread that called RhpShutdown instead, assuming that this is the thread that does the shutdown or exited from main(). This will cover most if not all scenarios that we support for 7.0

For v.Next we can think of a better solution that would handle cases like NativeAOT libraries, ExitProcess called from native code or multiple times, and preferably find a way to not call managed code at process shutdown.

Although, if fiber detach on Windows and TLS dtor on Unix are guaranteed to not hold locks, they may still be a part of the solution.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a complete fix for the problem, but it is definitely improvement. We should look at the full fix in .NET 8 as you have said. Thank you!

@VSadov
Copy link
Member Author

VSadov commented Aug 27, 2022

Thanks!

@VSadov VSadov merged commit 0a60d95 into dotnet:main Aug 27, 2022
@VSadov VSadov deleted the mainThr branch August 27, 2022 15:22
@jkotas
Copy link
Member

jkotas commented Aug 27, 2022

@VSadov I think that this is candidate for .NET 7 backport. Do you agree?

@VSadov
Copy link
Member Author

VSadov commented Aug 27, 2022

@VSadov I think that this is candidate for .NET 7 backport. Do you agree?

yes, I will initiate a backport.
I wanted to file a follow-up issue first, so we will not forget to revisit.

@VSadov
Copy link
Member Author

VSadov commented Aug 27, 2022

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/2939886329

@ghost ghost locked as resolved and limited conversation to collaborators Sep 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants