-
Notifications
You must be signed in to change notification settings - Fork 50
Description
When running MonteCarlo sims with more jobs than slaves, the forked sim slaves will occasionally hang in ThreadBase::join_thread(). This results in a the jobs eventually being cancelled by Master, if the default timeout is set. It looks like join was being called on threads that had already been cleaned up, since the threads in question didn't appear in GDB. It also always seemed to stop on the VariableServerListen thread.
I have no idea what the steps to reproduce this are. @Pherring04 has promised to testify on my behalf. I was able to fix my forked sims hanging on not returning data to master by commenting out like so:
int Trick::ThreadBase::join_thread()
{
/*
if ( pthread_id != 0 ) {
if ((errno = pthread_join(pthread_id, NULL)) != 0) {
std::string msg = "Thread " + name + " had an error in join";
perror(msg.c_str());
} else {
pthread_id = 0;
}
}
*/
return(0) ;
}
Another interesting thing about this, is that the first batch of jobs always ran fin. It wasn't until the Slaves forked again that this behavior started happening.
The join_thread() in this case is called from SysThread::ensureAllShutdown()
// Join all threads
for (SysThread * thread : all_sys_threads()) {
thread->join_thread();
}