RFR: 8130728: Disable WorkAroundNPTLTimedWaitHang by default

Mon Jul 13 02:56:14 UTC 2015

Thanks Dan.

As there are no other opinions/concerns with this I will push it.

David

On 9/07/2015 11:32 PM, Daniel D. Daugherty wrote:
>  > http://cr.openjdk.java.net/~dholmes/8130728/webrev/
>
> src/share/vm/runtime/globals.hpp
>      No comments.
>
> Thumbs up.
>
> Time to see what shakes out with this option turned off.
>
> The BSD/MacOS X blame belongs to me. Our thinking at the time was
> to keep the "bsd" files matching the "linux" version unless we
> had a specific reason for changing the code. Of course, we thought
> we would come back later and clean things up... which just didn't
> happen... I don't know the AIX history.
>
> Dan
>
>
> On 7/9/15 3:27 AM, David Holmes wrote:
>> https://bugs.openjdk.java.net/browse/JDK-8130728
>>
>> The workaround triggered by WorkAroundNPTLTimedWaitHang (an
>> experimental flag set to 1 by default) was to alleviate a hang that
>> could occur on a pthread_cond_timedwait on Linux, if the timeout value
>> was a time in the past - see JDK-6314875. This glibc bug was fixed in
>> 2005 in glibc 2.3.5-3
>>
>> https://lists.debian.org/debian-glibc/2005/08/msg00201.html
>>
>> but the workaround persists and was (unfortunately) copied into the
>> BSD and AIX ports.
>>
>> It is time to remove that workaround but before we can do that we need
>> to be sure that we are not in fact hitting the workaround code. To
>> that end I propose to use this bug to switch the flag's value to 0 to
>> verify correct operation.
>>
>> If that goes well then we can remove the code either later in JDK 9 or
>> early in JDK 10.
>>
>> To be clear the flag impacts both the wait and the signal part of the
>> condvar usage (park and unpark). On the wait side we have:
>>
>>     status = pthread_cond_timedwait(_cond, _mutex, &abst);
>>     if (status != 0 && WorkAroundNPTLTimedWaitHang) {
>>       pthread_cond_destroy(_cond);
>>       pthread_cond_init(_cond, os::Linux::condAttr());
>>     }
>>
>> so we will now skip this potential recovery code.
>>
>> On the signal side we signal while holding the mutex if the workaround
>> is enabled, and we signal after dropping the mutex otherwise. So this
>> code will now be using a new path. Here is the code in
>> PlatformEvent::unpark
>>
>>   assert(AnyWaiters == 0 || AnyWaiters == 1, "invariant");
>>   if (AnyWaiters != 0 && WorkAroundNPTLTimedWaitHang) {
>>     AnyWaiters = 0;
>>     pthread_cond_signal(_cond);
>>   }
>>   status = pthread_mutex_unlock(_mutex);
>>   assert_status(status == 0, status, "mutex_unlock");
>>   if (AnyWaiters != 0) {
>>     // Note that we signal() *after* dropping the lock for "immortal"
>> Events.
>>     // This is safe and avoids a common class of  futile wakeups. In rare
>>     // circumstances this can cause a thread to return prematurely from
>>     // cond_{timed}wait() but the spurious wakeup is benign and the
>> victim
>>     // will simply re-test the condition and re-park itself.
>>     // This provides particular benefit if the underlying platform
>> does not
>>     // provide wait morphing.
>>     status = pthread_cond_signal(_cond);
>>     assert_status(status == 0, status, "cond_signal");
>>   }
>>
>> and similarly in Parker::unpark
>>
>>      if (WorkAroundNPTLTimedWaitHang) {
>>         status = pthread_cond_signal(&_cond[_cur_index]);
>>         assert(status == 0, "invariant");
>>         status = pthread_mutex_unlock(_mutex);
>>         assert(status == 0, "invariant");
>>       } else {
>>         status = pthread_mutex_unlock(_mutex);
>>         assert(status == 0, "invariant");
>>         status = pthread_cond_signal(&_cond[_cur_index]);
>>         assert(status == 0, "invariant");
>>       }
>>
>> This may cause performance perturbations that will need to be
>> investigated - but in theory, as per the comment above, signalling
>> outside the lock should be beneficial in the linux case because there
>> is no wait-morphing (where a signal simply takes a waiting thread and
>> places it into the mutex queue thereby avoiding the need to awaken the
>> thread so it can enqueue itself).
>>
>> The change itself is of course trivial:
>>
>> http://cr.openjdk.java.net/~dholmes/8130728/webrev/
>>
>> I'd appreciate any comments/concerns particularly from the BSD and AIX
>> folk who acquired this unnecessary workaround.
>>
>> If deemed necessary we could add a flag that controls whether the
>> signal happens with or without the lock held.
>>
>> Thanks,
>> David
>