- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.8k
mac,thread: fix pthread_barrier_wait serial thread #2003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Ping @cjihrig @bnoordhuis @saghul Would you mind taking a peek at this PR? | 
| Hi @ofrobots! 
 Plausible. But from this I understand that you don't have proof, just a guess? Is the implementation wrong?
 Let us say instead, it is not safe for any other thread to successfully destroy the barrier until all threads have exited it safely. The  The corresponding libuv polyfill for  As a result, my understanding is that: 
 (I might be wrong, just sharing my perspective on the code/spec). What should the implementation do?The return value  | 
| 
 Am I reasonably certain? Yes. The crashes that I observe match the thread races as I have described them. With my change, the crashes get fixed. 
 I can buy that argument. The more interesting question is whether  
 The only documentation for  With that in mind, I stand by the fix I provided here, but I am open to alternative suggestions if you have them. | 
| Discussion
 Maybe I misunderstood the behavior you are experiencing. 
 
 The libuv docs say here that "The API largely follows the pthreads API", which is why I was looking at the POSIX spec. Your patchI am concerned about your implementation in the case of a barrier initialized with a limit of 1. | 
        
          
                src/unix/thread.c
              
                Outdated
          
        
      |  | ||
| pthread_mutex_unlock(&barrier->mutex); | ||
| return PTHREAD_BARRIER_SERIAL_THREAD; | ||
| return 0; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a 1-thread barrier, shouldn't this return PTHREAD_BARRIER_SERIAL_THREAD?
Can we put the do-while loop under an else so that all threads share the same exit path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍Fixed.
        
          
                src/unix/thread.c
              
                Outdated
          
        
      |  | ||
| int pthread_barrier_wait(pthread_barrier_t* barrier) { | ||
| int rc; | ||
| char last; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use char instead of int?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
        
          
                src/unix/thread.c
              
                Outdated
          
        
      | if ((rc = pthread_cond_wait(&barrier->cond, &barrier->mutex)) != 0) | ||
| break; | ||
| } while (barrier->in != 0); | ||
| assert(rc == 0); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be just:
do {
  rc = pthread_cond_wait(&barrier->cond, &barrier->mutex);
  assert(rc == 0);
} while (barrier->in != 0);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it... everywhere else we call abort() when pthread functions fail when they logically can't.  Would be good to do that here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. There are a few more places where we should be aborting if a pthread function fails illogically. I didn't include those in this PR to keep the review delta small. I think they can be addressed in a follow-on.
        
          
                src/unix/thread.c
              
                Outdated
          
        
      | /* Mark thread exit. If this the last thread to exit, return | ||
| PTHREAD_BARRIER_SERIAL_THREAD. */ | ||
| last = (--barrier->out == 0); | ||
| pthread_cond_signal(&barrier->cond); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization: if this is the last thread, don't call pthread_cond_signal(); there isn't anyone to signal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cf025f6    to
    fb3cfab      
    Compare
  
    | 
 I have added a commit with a test ( | 
fb3cfab    to
    e34cd4a      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, mostly LGTM.  Is the third commit (the one that registers barrier_serial_thread_single) supposed to be squashed with the first one?
        
          
                test/test-barrier.c
              
                Outdated
          
        
      | } | ||
|  | ||
| static void serial_worker(void* data) { | ||
| uv_barrier_t* barrier = (uv_barrier_t*) data; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
        
          
                test/test-barrier.c
              
                Outdated
          
        
      | exited the barrier. If this value is returned too early and the barrier is | ||
| destroyed prematurely, then this test may see a crash. */ | ||
| TEST_IMPL(barrier_serial_thread) { | ||
| uv_thread_t threads[NUM_WORKERS]; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be slightly more idiomatic to declare this as uv_thread_t threads[4]; and then use ARRAY_SIZE(threads) everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
        
          
                test/test-barrier.c
              
                Outdated
          
        
      |  | ||
| ASSERT(0 == uv_barrier_init(&barrier, NUM_WORKERS + 1)); | ||
|  | ||
| for (int i = 0; i < NUM_WORKERS; ++i) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you declare int i at the top?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| 
 I have (at last!) identified the cause. Maybe this was obvious to everyone already, but just in case... I have been studying the source of the libuv polyfills, and I can't find a thread schedule that has  But you're calling  The  In summary: 
 I don't like the prospect of memory leaks, so I support your patch. | 
e34cd4a    to
    01e18bf      
    Compare
  
    | 
 It is not clear (to me) which spec applies here anyway.  I suggest we move this discussion to another issue as it is not necessary to resolve the bug that I am trying to fix. @bnoordhuis I've squashed the commit and address the rest of the comments. Thank you both for your review and insights! | 
| Should this be targeting v1.x? | 
| Agreed, @ofrobots can you rebase? | 
01e18bf    to
    a989af1      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm still interested in opinions about the assert in uv_barrier_destroy, but this is not a blocker for me.
| @davisjam can we fork that discussion into a separate issue? I would this change to land. This fix is blocking a cascade of PRs in Node.js. | 
| What are the next steps here, is there a CI somewhere that can be run? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a CI somewhere that can be run?
Indeed there is: https://ci.nodejs.org/view/libuv/job/libuv-test-commit/1047/
        
          
                src/unix/thread.c
              
                Outdated
          
        
      |  | ||
| int pthread_barrier_wait(pthread_barrier_t* barrier) { | ||
| int rc; | ||
| int rc, last; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style issue: one declaration per line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The thread returning PTHREAD_BARRIER_SERIAL_THREAD should be the one last to exit the barrier, otherwise it is unsafe to destroy the barrier – other threads may have yet to exit.
a989af1    to
    67130df      
    Compare
  
    | Windows test failure seems unrelated. AIX failure is relevant and needs to be investigated. I have opened nodejs/build#1514 to get login access to AIX to be able to investigate this. | 
| It seems that the AIX implementation of  This means that that default example of  if (uv_barrier_wait(&barrier) > 0)
    uv_barrier_destroy(&barrier);I am forced to conclude that the  uv_barrier_wait(&barrier);
if (0 == uv_barrier_destroy(&barrier)) {
  // any additional cleanup, e.g. free of the barrier.
} | 
| I also noticed that we have AIX 6.1 in the CI. As per https://www-01.ibm.com/support/docview.wss?uid=isg3T1012517, AIX 6.1 may already be EOL since last year? It is possible that the behaviour is different/fixed on AIX 7. Looking at node supported platforms: https://github.com/nodejs/node/blob/master/BUILDING.md#supported-platforms-1, AIX 7 is the minimum supported. libuv claims to support AIX 6 though. | 
It was pointed out that pthread_barrier_wait() behaves slightly different from other platforms. Switch to libuv's own thread barrier for uniformity of behavior. Perhaps we'll do that for more platforms in the future. Refs: libuv#2003 (comment)
| Maybe we should use our own barrier implementation everywhere for the sake of uniform behavior, or at least switch to it on AIX. I've opened #2019 and included the changes from this PR. | 
| 👍. Deferring to #2019. | 
It was pointed out that pthread_barrier_wait() behaves slightly different from other platforms. Switch to libuv's own thread barrier for uniformity of behavior. Perhaps we'll do that for more platforms in the future. PR-URL: libuv#2019 Refs: libuv#2003 (comment) Reviewed-By: Santiago Gimeno <[email protected]>
Libuv's own thread barrier implementation signaled completion to the
first waiter that saw the threshold being reached, contrary to what
some native pthreads barrier implementations do, which is to signal
it to the _last_ waiter.
Libuv's behavior is not strictly non-conforming but it's inconvenient
because it means this snippet (that appears in the libuv documentation)
has a race condition in it:
    if (uv_barrier_wait(&barrier) > 0)
      uv_barrier_destroy(&barrier);  // can still have waiters
This issue was discovered and fixed by Ali Ijaz Sheikh, a.k.a @ofrobots,
but some refactoring introduced conflicts in his pull request and I
didn't have the heart to ask him to redo it from scratch. :-)
PR-URL: libuv#2019
Refs: libuv#2003
Reviewed-By: Santiago Gimeno <[email protected]>
    
Hi Everyone 👋. This is my first PR here. While trying to fix nodejs/node#23065 using
uv_barriers, I noticed a thread race on my mac that, I believe, stems from a buggy polyfill ofpthread_barrier_waitin libuv.Here's the documentation on
uv_barrier_wait:The actual implementation returned
PTHREAD_BARRIER_SERIAL_THREAD(i.e. > 0) on the last thread to enter the wait. IMHO this value should be returned on the last thread to exit the wait. It is not safe for any other thread to destroy the barrier as there may be threads yet to exit the barrier.Checking with implementations elsewhere, refer to the the implementation of
absl::Barrier::Blockfrom the excellent abseil libraries, and they seem to agree with my assessment.I didn't know about
pthread_barrier_tuntil yesterday so it is possible I am figuring things wrong. PTAL.