RFR (XS): 8236177: assert(status == 0) failed: error ETIMEDOUT(60), cond_wait

David Holmes david.holmes at oracle.com
Fri Mar 27 22:53:49 UTC 2020


Hi Gerard,

On 28/03/2020 2:23 am, gerard ziemski wrote:
> Thnak you David for your feedback!
> 
> On 3/25/20 6:43 PM, David Holmes wrote:
>> Hi Gerard,
>>
>> On 26/03/2020 4:03 am, gerard ziemski wrote:
>>> hi all,
>>>
>>> Please review this "workaround" for now, which can not be called an 
>>> actual fix just yet, designed to figure out why on Mac OS X, we get 
>>> (very rarely) ETIMEDOUT when calling pthread_cond_wait() API. On 
>>> other hand, it might actually fix it.
>>
>> The ETIMEDOUT should be treated as a "spurious wakeup" and we will 
>> naturally retry the wait if the condition is not yet met. All we have 
>> to do to our code is adjust the assert so that ETIMEDOUT doesn't cause 
>> it to fail.
> 
> My initial approach, as I noted in the bug, was to do exactly that, but 
> when you commented in the bug, that we used to run into these kinds of 
> issues before, and used to have workarounds in place for that, it made 
> me worry that we will do again the workaround, and at some point in the 
> future we remove it again, and we will back to square one.

I'm sorry my comment misled you.

> I was hoping that now we do something to get a better sense of what the 
> underlying issue is, which is why I proposed this change instead.
> 
> Is that OK?

There is no need for that kind of workaround. The effect of the bug is 
at worst a spurious wakeup (and we can't tell if it is spurious or not 
without more investigation - not that it matters.) We just need to fix 
the assert.

> If you really want us to do what you suggest, then would you mind 
> sharing what you remember about the previous workaround, and why it was 
> removed? If anyone else remembers anything about it, hen please share 
> here as well.

There have been a number of issues as NPTL evolved. Here's a key one:

// Beware -- Some versions of NPTL embody a flaw where 
pthread_cond_timedwait() can
// hang indefinitely.  For instance NPTL 0.60 on 2.4.21-4ELsmp is 
vulnerable.
// For specifics regarding the bug see GLIBC BUGID 261237 :
// 
http://www.mail-archive.com/debian-glibc@lists.debian.org/msg10837.html.
// Briefly, pthread_cond_timedwait() calls with an expiry time that's 
not in the future
// will either hang or corrupt the condvar, resulting in subsequent 
hangs if the condvar
// is used.

This one was addressed via the WorkAroundNTPLTimedWaitHang flag. For 
which we did this hack:

     status = os::Linux::safe_cond_timedwait(_cond, _mutex, &abst);
     if (status != 0 && WorkAroundNPTLTimedWaitHang) {
       pthread_cond_destroy (_cond);
       pthread_cond_init (_cond, NULL) ;
     }

Note however I was mis-remembering the structure here. The workaround 
was in  park() etc not inside os::Linux::safe_cond_timedwait (which was 
need to hide the selection of NPTL or LinuxThreads logic).

These workarounds eventually get removed when we no longer have any 
concerns about running on a platform that would still have the bug.

Thanks,
David

> 
> cheers
> 


More information about the hotspot-runtime-dev mailing list