RFR: 8218975: Bug in macOSX kernel's pthread support

Kim Barrett kim.barrett at oracle.com
Fri Mar 15 23:03:33 UTC 2019


Please review this fix for intermittent Mac-only crashes involving the
OWSTTaskTerminator.

This is joint work with Patricio Chilano Mateo. 

This turned out to be a bug in the macOSX kernel's pthread support
(which may be fixed in OSX 10.14 (Mojave)). When using a condvar, an
associated broadcast entry may be added to a kernel table and left
there after the last use of the condvar. If we then destroy and delete
the condvar and later happen to allocate and init a mutex at the same
address (this really does happen), that stale condvar entry in the
kernel table confuses operations on the mutex, causing crashes and
hangs.

We have been able to reproduce those failure modes with standalone
programs (no JDK code involved) (see the CR for reproducers), and have
reported the problem to Apple.  This reproducer didn't fail when
tested on Mojave, but did fail on all previous OS versions we tried.

The OWSTTaskTerminator is subject to this problem because a new
HotSpot Monitor object is allocated for each terminator. Terminators
are allocated for various parallel tasks during garbage collection, so
there are quite a few Monitor objects (and their contained
PlatformMonitors) being created and deleted, so many opportunities to
get the kernel into the bad state and then later reuse a previous
condvar address for a mutex.  That was enough for the stress test
(gc/stress/TestStressIHOPMultiThread.java) to hit the problem
occasionally.

This problem would not have shown up before JDK-8210832, which was
made shortly before the first sighting of the stress test failure.
Before that change we re-used park events in the implementation of
HotSpot Mutex/Monitor.  Because of that re-use, there wasn't an
opportunity to allocate a pthread_mutex_t at the same address as a
former pthread_cond_t.

We work around this problem by allocating from a freelist the
mutex/condvar pairs needed by the macOSX PlatformMonitor.

An alternative workaround that was explored is to (for macOSX only)
add a short timedwait on the condvar when deleting a PlatformMonitor.
The idea is that (1) if there is a lingering kernel table entry for the
condvar, the timedwait will eat it, and (2) if there isn't such an
entry the timedwait will (almost) immediately expire.  This would use
pthread_cond_timedwait_relative_np to avoid unneeded clock accesses.
This approach isn't being taken because it might be sensitive to
implementation details that could vary between OS versions.

CR:
https://bugs.openjdk.java.net/browse/JDK-8218975

Webrev:
http://cr.openjdk.java.net/~kbarrett/8218975/open.01/

Testing:
mach5 tier1-5
mach5 the specific test, 2000 times with no failures.  Without this
  change, typical failure rate seems to be on the order of 0.5-1.0%.
Performance testing on Mac found no regressions.



More information about the hotspot-dev mailing list