RFR: 8218975: Bug in macOSX kernel's pthread support
Patricio Chilano
patricio.chilano.mateo at oracle.com
Sat Mar 16 02:43:12 UTC 2019
Hi Kim,
Change looks good to me!
Thanks,
Patricio
On 3/15/19 7:03 PM, Kim Barrett wrote:
> Please review this fix for intermittent Mac-only crashes involving the
> OWSTTaskTerminator.
>
> This is joint work with Patricio Chilano Mateo.
>
> This turned out to be a bug in the macOSX kernel's pthread support
> (which may be fixed in OSX 10.14 (Mojave)). When using a condvar, an
> associated broadcast entry may be added to a kernel table and left
> there after the last use of the condvar. If we then destroy and delete
> the condvar and later happen to allocate and init a mutex at the same
> address (this really does happen), that stale condvar entry in the
> kernel table confuses operations on the mutex, causing crashes and
> hangs.
>
> We have been able to reproduce those failure modes with standalone
> programs (no JDK code involved) (see the CR for reproducers), and have
> reported the problem to Apple. This reproducer didn't fail when
> tested on Mojave, but did fail on all previous OS versions we tried.
>
> The OWSTTaskTerminator is subject to this problem because a new
> HotSpot Monitor object is allocated for each terminator. Terminators
> are allocated for various parallel tasks during garbage collection, so
> there are quite a few Monitor objects (and their contained
> PlatformMonitors) being created and deleted, so many opportunities to
> get the kernel into the bad state and then later reuse a previous
> condvar address for a mutex. That was enough for the stress test
> (gc/stress/TestStressIHOPMultiThread.java) to hit the problem
> occasionally.
>
> This problem would not have shown up before JDK-8210832, which was
> made shortly before the first sighting of the stress test failure.
> Before that change we re-used park events in the implementation of
> HotSpot Mutex/Monitor. Because of that re-use, there wasn't an
> opportunity to allocate a pthread_mutex_t at the same address as a
> former pthread_cond_t.
>
> We work around this problem by allocating from a freelist the
> mutex/condvar pairs needed by the macOSX PlatformMonitor.
>
> An alternative workaround that was explored is to (for macOSX only)
> add a short timedwait on the condvar when deleting a PlatformMonitor.
> The idea is that (1) if there is a lingering kernel table entry for the
> condvar, the timedwait will eat it, and (2) if there isn't such an
> entry the timedwait will (almost) immediately expire. This would use
> pthread_cond_timedwait_relative_np to avoid unneeded clock accesses.
> This approach isn't being taken because it might be sensitive to
> implementation details that could vary between OS versions.
>
> CR:
> https://bugs.openjdk.java.net/browse/JDK-8218975
>
> Webrev:
> http://cr.openjdk.java.net/~kbarrett/8218975/open.01/
>
> Testing:
> mach5 tier1-5
> mach5 the specific test, 2000 times with no failures. Without this
> change, typical failure rate seems to be on the order of 0.5-1.0%.
> Performance testing on Mac found no regressions.
>
More information about the hotspot-dev
mailing list