RFR (S): 8228857: Refactor PlatformMonitor into PlatformMutex and PlatformMonitor

Tue Aug 6 22:10:18 UTC 2019

Hi Kim,

On 7/08/2019 7:20 am, Kim Barrett wrote:
>> On Aug 6, 2019, at 7:06 AM, David Holmes <david.holmes at oracle.com> wrote:
>>
>> Hi Per,
>>
>> On 6/08/2019 7:12 pm, Per Liden wrote:
>>> Hi David,
>>> On 8/6/19 10:06 AM, David Holmes wrote:
>>>> Hi Per,
>>>>
>>>> On 6/08/2019 5:42 pm, Per Liden wrote:
>>>>> Hi David,
>>>>>
>>>>> On 8/2/19 1:03 AM, David Holmes wrote:
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8228857
>>>>>> webrev: http://cr.openjdk.java.net/~dholmes/8228857/webrev/
>>>>>
>>>>> src/hotspot/os/posix/os_posix.hpp
>>>>> ---------------------------------
>>>>> As far as I recall, the macos bug is only related to the lifecycle of condvars (not mutexes). Is that correct?
>>>>
>>>> Not quite. The issue is the allocation of a pthread_mutex_t at an address that previously held a pthread_condvar_t.
>>>>
>>>>> If so, PlatformMutex could just always own a plain pthread_mutex_t, and the PLATFORM_MONITOR_IMPL_INDIRECT stuff could be moved to PlatformMonitor to and only deal with the lifecycle of the pthread_cond_t. That way the PlatformMutex would not carry the burden/cost for working around the condvar bug, which would make PlatformMutex even more attractive as a replacement/backing for ZLocks.
>>>>> How does that sound?
>>>>
>>>> I deliberately took the simplest path to just leave the macOS workaround as is, knowing it means we still pay for the condvar slot on that platform. :)
>>>>
>>>> Because the bug involves both types I don't think I can constrain the workaround to only the PlatformMonitor class, as I'd need to control the allocation in the PlatformMutex as well. I could do that with defining and instantiating a different Impl type in PlatformMonitor to be installed by PlatformMutex, but the overall complexity increase doesn't seem like a good trade off to me. Perhaps I've missed a simple way to deal with this?
>>> Ok. So we can't have a mutex_t at an address where a cond_t has lived in the past. We guarantee that this can not happen by never actually deallocating any cond_t and instead have our own freelist, essentially protecting those cond_t addresses from ever being reused for any other type. Even if this bug involves both types, just constraining cond_t should be enough to solve the problem, right?
>>> Just to illustrate what I'm thinking, I adjusted your patch:
>>> http://cr.openjdk.java.net/~pliden/8228857/webrev.0
>>> And here's just the diff against your patch:
>>> http://cr.openjdk.java.net/~pliden/8228857/webrev.0-diff
>>> This keeps the code pretty similar, but puts the cost on PlatforMonitor. Can you spot a problem with this?
>>
>> First thanks for taking the time to do this - much appreciated.
>>
>> I had considered doing something like this, but then was concerned as to whether the pthread_mutex_t could by chance end up in a location that had once been a pthread_cond_t and so still hit the bug. But I think I was overthinking this as the same could be true for the allocation of the Impl instances.
>>
>> I'll take your proposed changes for a test run and in particular see if I can re-run the tests that demonstrated the kernel bug.
> 
> I was planning to review this, but haven't gotten to it yet. I
> initially had the same thought as Per. However, I changed my mind
> about it, with the following reasoning.
> 
> It is true that only a pthread_cond_t that has been allocated and
> later destroyed can cause problems. However, we (the VM) might not be
> the only place doing such. Native code, library code, &etc might also
> do that. Even the current mechanism doesn't fully protect us from
> that. This was something that Patricio and I recognized when we did
> the MacOSX workaround; it's not completely bullet proof.
> 
> However, with the current mechanism of allocating pthread_mutex/cond_t
> pairs and freelisting them, we highwater pretty quickly. We don't have
> many places that actually release them. (I don't remember whether
> there's anything other than the task terminator mechanism.) And once
> we've highwatered, the VM is safe from any further condvar
> destructions by non-VM code. So the only risk is from native code
> using pthread_cond_t's before that highwater mark is reached.
> 
> If the VM only freelists its pthread_cond_t's and not its
> pthread_mutex_t's, instead allocating and deleting the latter normally
> (as Per suggested and I had originally intended to suggest), then the
> VM's pthread_mutex_t's are subject to getting scrambled by ongoing
> non-VM pthread_cond_t usage.

Yes you are right. I was only thinking about VM usage, but if the VM is 
hosted then mutex/condvar could be being used a lot and not using the 
freelist for the mutex would greatly increase the risk of reintroducing 
the bug.

As you say the freelist approach itself is not bullet-proof and I'm now 
worried about what complex hosting applications may encounter! :( I 
think we may need to try and engage with Apple to get the full details 
on this bug so that we know what our exposure may be. I hope they fixed 
it with an interim patch and not only with a full release update!

So in terms of this RFE I leave things as they stand, with the 
mutex/cond freelist being used in PlatformMutex. I hope the slight space 
wastage is not significant enough to mean that Zlocks will not consider 
using PlatformMutex.

Thanks,
David