RFR: 8218975: Bug in macOSX kernel's pthread support

Daniel D. Daugherty daniel.daugherty at oracle.com
Mon Mar 18 16:53:07 UTC 2019


I forgot to ask the meta question about how we are going to decide when
this freelist work around can be retired so that MacOSX has the same
code as the other platforms?

Dan


On 3/18/19 12:50 PM, Daniel D. Daugherty wrote:
> On 3/15/19 7:03 PM, Kim Barrett wrote:
>> Please review this fix for intermittent Mac-only crashes involving the
>> OWSTTaskTerminator.
>>
>> This is joint work with Patricio Chilano Mateo.
>
> Very impressive bug hunt!
>
>
>> This turned out to be a bug in the macOSX kernel's pthread support
>> (which may be fixed in OSX 10.14 (Mojave)). When using a condvar, an
>> associated broadcast entry may be added to a kernel table and left
>> there after the last use of the condvar. If we then destroy and delete
>> the condvar and later happen to allocate and init a mutex at the same
>> address (this really does happen), that stale condvar entry in the
>> kernel table confuses operations on the mutex, causing crashes and
>> hangs.
>>
>> We have been able to reproduce those failure modes with standalone
>> programs (no JDK code involved) (see the CR for reproducers), and have
>> reported the problem to Apple.  This reproducer didn't fail when
>> tested on Mojave, but did fail on all previous OS versions we tried.
>>
>> The OWSTTaskTerminator is subject to this problem because a new
>> HotSpot Monitor object is allocated for each terminator. Terminators
>> are allocated for various parallel tasks during garbage collection, so
>> there are quite a few Monitor objects (and their contained
>> PlatformMonitors) being created and deleted, so many opportunities to
>> get the kernel into the bad state and then later reuse a previous
>> condvar address for a mutex.  That was enough for the stress test
>> (gc/stress/TestStressIHOPMultiThread.java) to hit the problem
>> occasionally.
>>
>> This problem would not have shown up before JDK-8210832, which was
>> made shortly before the first sighting of the stress test failure.
>> Before that change we re-used park events in the implementation of
>> HotSpot Mutex/Monitor.  Because of that re-use, there wasn't an
>> opportunity to allocate a pthread_mutex_t at the same address as a
>> former pthread_cond_t.
>>
>> We work around this problem by allocating from a freelist the
>> mutex/condvar pairs needed by the macOSX PlatformMonitor.
>>
>> An alternative workaround that was explored is to (for macOSX only)
>> add a short timedwait on the condvar when deleting a PlatformMonitor.
>> The idea is that (1) if there is a lingering kernel table entry for the
>> condvar, the timedwait will eat it, and (2) if there isn't such an
>> entry the timedwait will (almost) immediately expire.  This would use
>> pthread_cond_timedwait_relative_np to avoid unneeded clock accesses.
>> This approach isn't being taken because it might be sensitive to
>> implementation details that could vary between OS versions.
>>
>> CR:
>> https://bugs.openjdk.java.net/browse/JDK-8218975
>>
>> Webrev:
>> http://cr.openjdk.java.net/~kbarrett/8218975/open.01/
>
> src/hotspot/os/posix/os_posix.hpp
>     Good job on making the freelist work around only apply
>     to MacOSX.
>
> src/hotspot/os/posix/os_posix.cpp
>     No comments.
>
> src/hotspot/os/posix/os_posix.inline.hpp
>     No comments.
>
> I have no objection to your use of the '_ptr' suffix.
>
> Thumbs up.
>
> Dan
>
>
>
>>
>> Testing:
>> mach5 tier1-5
>> mach5 the specific test, 2000 times with no failures.  Without this
>>    change, typical failure rate seems to be on the order of 0.5-1.0%.
>> Performance testing on Mac found no regressions.
>>
>
>



More information about the hotspot-dev mailing list