RFR: 8130728: Disable WorkAroundNPTLTimedWaitHang by default

David Holmes david.holmes at oracle.com
Thu Jul 9 09:27:43 UTC 2015


https://bugs.openjdk.java.net/browse/JDK-8130728

The workaround triggered by WorkAroundNPTLTimedWaitHang (an experimental 
flag set to 1 by default) was to alleviate a hang that could occur on a 
pthread_cond_timedwait on Linux, if the timeout value was a time in the 
past - see JDK-6314875. This glibc bug was fixed in 2005 in glibc 2.3.5-3

https://lists.debian.org/debian-glibc/2005/08/msg00201.html

but the workaround persists and was (unfortunately) copied into the BSD 
and AIX ports.

It is time to remove that workaround but before we can do that we need 
to be sure that we are not in fact hitting the workaround code. To that 
end I propose to use this bug to switch the flag's value to 0 to verify 
correct operation.

If that goes well then we can remove the code either later in JDK 9 or 
early in JDK 10.

To be clear the flag impacts both the wait and the signal part of the 
condvar usage (park and unpark). On the wait side we have:

     status = pthread_cond_timedwait(_cond, _mutex, &abst);
     if (status != 0 && WorkAroundNPTLTimedWaitHang) {
       pthread_cond_destroy(_cond);
       pthread_cond_init(_cond, os::Linux::condAttr());
     }

so we will now skip this potential recovery code.

On the signal side we signal while holding the mutex if the workaround 
is enabled, and we signal after dropping the mutex otherwise. So this 
code will now be using a new path. Here is the code in PlatformEvent::unpark

   assert(AnyWaiters == 0 || AnyWaiters == 1, "invariant");
   if (AnyWaiters != 0 && WorkAroundNPTLTimedWaitHang) {
     AnyWaiters = 0;
     pthread_cond_signal(_cond);
   }
   status = pthread_mutex_unlock(_mutex);
   assert_status(status == 0, status, "mutex_unlock");
   if (AnyWaiters != 0) {
     // Note that we signal() *after* dropping the lock for "immortal" 
Events.
     // This is safe and avoids a common class of  futile wakeups.  In rare
     // circumstances this can cause a thread to return prematurely from
     // cond_{timed}wait() but the spurious wakeup is benign and the victim
     // will simply re-test the condition and re-park itself.
     // This provides particular benefit if the underlying platform does not
     // provide wait morphing.
     status = pthread_cond_signal(_cond);
     assert_status(status == 0, status, "cond_signal");
   }

and similarly in Parker::unpark

      if (WorkAroundNPTLTimedWaitHang) {
         status = pthread_cond_signal(&_cond[_cur_index]);
         assert(status == 0, "invariant");
         status = pthread_mutex_unlock(_mutex);
         assert(status == 0, "invariant");
       } else {
         status = pthread_mutex_unlock(_mutex);
         assert(status == 0, "invariant");
         status = pthread_cond_signal(&_cond[_cur_index]);
         assert(status == 0, "invariant");
       }

This may cause performance perturbations that will need to be 
investigated - but in theory, as per the comment above, signalling 
outside the lock should be beneficial in the linux case because there is 
no wait-morphing (where a signal simply takes a waiting thread and 
places it into the mutex queue thereby avoiding the need to awaken the 
thread so it can enqueue itself).

The change itself is of course trivial:

http://cr.openjdk.java.net/~dholmes/8130728/webrev/

I'd appreciate any comments/concerns particularly from the BSD and AIX 
folk who acquired this unnecessary workaround.

If deemed necessary we could add a flag that controls whether the signal 
happens with or without the lock held.

Thanks,
David


More information about the hotspot-runtime-dev mailing list