RFR (S): 8228857: Refactor PlatformMonitor into PlatformMutex and PlatformMonitor

Tue Aug 6 21:20:56 UTC 2019

> On Aug 6, 2019, at 7:06 AM, David Holmes <david.holmes at oracle.com> wrote:
> 
> Hi Per,
> 
> On 6/08/2019 7:12 pm, Per Liden wrote:
>> Hi David,
>> On 8/6/19 10:06 AM, David Holmes wrote:
>>> Hi Per,
>>> 
>>> On 6/08/2019 5:42 pm, Per Liden wrote:
>>>> Hi David,
>>>> 
>>>> On 8/2/19 1:03 AM, David Holmes wrote:
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8228857
>>>>> webrev: http://cr.openjdk.java.net/~dholmes/8228857/webrev/
>>>> 
>>>> src/hotspot/os/posix/os_posix.hpp
>>>> ---------------------------------
>>>> As far as I recall, the macos bug is only related to the lifecycle of condvars (not mutexes). Is that correct?
>>> 
>>> Not quite. The issue is the allocation of a pthread_mutex_t at an address that previously held a pthread_condvar_t.
>>> 
>>>> If so, PlatformMutex could just always own a plain pthread_mutex_t, and the PLATFORM_MONITOR_IMPL_INDIRECT stuff could be moved to PlatformMonitor to and only deal with the lifecycle of the pthread_cond_t. That way the PlatformMutex would not carry the burden/cost for working around the condvar bug, which would make PlatformMutex even more attractive as a replacement/backing for ZLocks.
>>>> How does that sound?
>>> 
>>> I deliberately took the simplest path to just leave the macOS workaround as is, knowing it means we still pay for the condvar slot on that platform. :)
>>> 
>>> Because the bug involves both types I don't think I can constrain the workaround to only the PlatformMonitor class, as I'd need to control the allocation in the PlatformMutex as well. I could do that with defining and instantiating a different Impl type in PlatformMonitor to be installed by PlatformMutex, but the overall complexity increase doesn't seem like a good trade off to me. Perhaps I've missed a simple way to deal with this?
>> Ok. So we can't have a mutex_t at an address where a cond_t has lived in the past. We guarantee that this can not happen by never actually deallocating any cond_t and instead have our own freelist, essentially protecting those cond_t addresses from ever being reused for any other type. Even if this bug involves both types, just constraining cond_t should be enough to solve the problem, right?
>> Just to illustrate what I'm thinking, I adjusted your patch:
>> http://cr.openjdk.java.net/~pliden/8228857/webrev.0
>> And here's just the diff against your patch:
>> http://cr.openjdk.java.net/~pliden/8228857/webrev.0-diff
>> This keeps the code pretty similar, but puts the cost on PlatforMonitor. Can you spot a problem with this?
> 
> First thanks for taking the time to do this - much appreciated.
> 
> I had considered doing something like this, but then was concerned as to whether the pthread_mutex_t could by chance end up in a location that had once been a pthread_cond_t and so still hit the bug. But I think I was overthinking this as the same could be true for the allocation of the Impl instances.
> 
> I'll take your proposed changes for a test run and in particular see if I can re-run the tests that demonstrated the kernel bug.

I was planning to review this, but haven't gotten to it yet. I
initially had the same thought as Per. However, I changed my mind
about it, with the following reasoning.

It is true that only a pthread_cond_t that has been allocated and
later destroyed can cause problems. However, we (the VM) might not be
the only place doing such. Native code, library code, &etc might also
do that. Even the current mechanism doesn't fully protect us from
that. This was something that Patricio and I recognized when we did
the MacOSX workaround; it's not completely bullet proof.

However, with the current mechanism of allocating pthread_mutex/cond_t
pairs and freelisting them, we highwater pretty quickly. We don't have
many places that actually release them. (I don't remember whether
there's anything other than the task terminator mechanism.) And once
we've highwatered, the VM is safe from any further condvar
destructions by non-VM code. So the only risk is from native code
using pthread_cond_t's before that highwater mark is reached.

If the VM only freelists its pthread_cond_t's and not its
pthread_mutex_t's, instead allocating and deleting the latter normally
(as Per suggested and I had originally intended to suggest), then the
VM's pthread_mutex_t's are subject to getting scrambled by ongoing
non-VM pthread_cond_t usage.