Skip to content

ThreadBase intermittently hangs in join_thread #1984

@MarRHerr

Description

@MarRHerr

When running MonteCarlo sims with more jobs than slaves, the forked sim slaves will occasionally hang in ThreadBase::join_thread(). This results in a the jobs eventually being cancelled by Master, if the default timeout is set. It looks like join was being called on threads that had already been cleaned up, since the threads in question didn't appear in GDB. It also always seemed to stop on the VariableServerListen thread.

I have no idea what the steps to reproduce this are. @Pherring04 has promised to testify on my behalf. I was able to fix my forked sims hanging on not returning data to master by commenting out like so:

int Trick::ThreadBase::join_thread() 
{
    /*
    if ( pthread_id != 0 ) {
        
        if ((errno = pthread_join(pthread_id, NULL)) != 0) {
            std::string msg = "Thread " + name + " had an error in join";
            perror(msg.c_str());
        } else {
            pthread_id = 0;
        }
    }
        */
    return(0) ;
}

Another interesting thing about this, is that the first batch of jobs always ran fin. It wasn't until the Slaves forked again that this behavior started happening.

The join_thread() in this case is called from SysThread::ensureAllShutdown()

// Join all threads
    for (SysThread * thread : all_sys_threads()) {
        thread->join_thread();
    }

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions