RFR 8240902: JDI shared memory connector can use already closed Handles
David Holmes
david.holmes at oracle.com
Thu Mar 19 07:50:55 UTC 2020
Hi Patricio,
Incremental changes look good.
Thanks,
David
On 19/03/2020 4:18 pm, Patricio Chilano wrote:
> Hi David,
>
> On 3/18/20 8:10 PM, David Holmes wrote:
>> Hi Patricio,
>>
>> On 19/03/2020 6:44 am, Patricio Chilano wrote:
>>> Hi David,
>>>
>>> On 3/18/20 4:27 AM, David Holmes wrote:
>>>> Hi Patricio,
>>>>
>>>> On 18/03/2020 6:14 am, Patricio Chilano wrote:
>>>>> Hi all,
>>>>>
>>>>> Please review the following patch:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240902
>>>>> Webrev: http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/
>>>>>
>>>>> Calling closeConnection() on an already created/opened connection
>>>>> includes calls to CloseHandle() on objects that can still be used
>>>>> by other threads. This can lead to either undefined behavior or, as
>>>>> detailed in the bug comments, changes of state of unrelated objects.
>>>>
>>>> This was a really great find!
>>> Thanks! : )
>>>
>>>>> This issue was found while debugging the reason behind some jshell
>>>>> test failures seen after pushing 8230594. Not as important, but
>>>>> there are also calls to closeStream() from
>>>>> createStream()/openStream() when failing to create/open a stream
>>>>> that will return after executing "CHECK_ERROR(enterMutex(stream,
>>>>> NULL));" without closing the intended resources. Then, calling
>>>>> closeConnection() could assert if the reason of the previous
>>>>> failure was that the stream's mutex failed to be created/opened.
>>>>> These patch aims to address these issues too.
>>>>
>>>> Patch looks good in general. The internal reference count guards
>>>> deletion of the internal resources, and is itself safe because never
>>>> actually delete the connection. Thanks for adding the comment about
>>>> this aspect.
>>>>
>>>> A few items:
>>>>
>>>> Please update copyright year before pushing.
>>> Done.
>>>
>>>> Please align ENTER_CONNECTION/LEAVE_CONNECTION macros the same way
>>>> as STREAM_INVARIANT.
>>> Done.
>>>
>>>> 170 unsigned int refcount;
>>>> 171 jint state;
>>>>
>>>> I'm unclear about the use of stream->state and connection->state as
>>>> guards - unless accessed under a mutex these would seem to at least
>>>> need acquire/release semantics.
>>>>
>>>> Additionally the reads of refcount would also seem to need to some
>>>> form of memory synchronization - though the Windows docs for the
>>>> Interlocked* API does not show how to simply read such a variable!
>>>> Though I note that the RtlFirstEntrySList method for the
>>>> "Interlocked Singly Linked Lists" API does state "Access to the list
>>>> is synchronized on a multiprocessor system." which suggests a read
>>>> of such a variable does require some form of memory synchronization!
>>> In the case of the stream struct, the state field is protected by the
>>> mutex field. It is set to STATE_CLOSED while holding the mutex, and
>>> threads that read it must acquire the mutex first through
>>> sysIPMutexEnter(). For the cases where sysIPMutexEnter() didn't
>>> acquire the mutex, we will return something different than SYS_OK and
>>> the call will exit anyways. All this behaves as before, I didn't
>>> change it.
>>
>> Thanks for clarifying.
>>
>>> The refcount and state that I added to the SharedMemoryConnection
>>> struct work together. For a thread closing the connection, setting
>>> the connection state to STATE_CLOSED has to happen before reading the
>>> refcount (more on the atomicity of that read later). That's why I
>>> added the MemoryBarrier() call; which I see it's better if I just
>>> move it to after setting the connection state to closed. For the
>>> threads accessing the connection, incrementing the refcount has to
>>> happen before reading the connection state. That's already provided
>>> by the InterlockedIncrement() which uses a full memory barrier. In
>>> this way if the thread closing the connection reads a refcount of 0,
>>> then we know it's safe to release the resources, since other threads
>>> accessing the connection will see that the state is closed after
>>> incrementing the refcount. If the read of refcount is not 0, then it
>>> could be that a thread is accessing the connection or not (it could
>>> have read a state connection of STATE_CLOSED after incrementing the
>>> refcount), we don't know, so we can't release anything. Similarly if
>>> the thread accessing the connection reads that the state is not
>>> closed, then we know it's safe to access the stream since anybody
>>> closing the connection will still have to read refcount which will be
>>> at least 1.
>>> As for the atomicity of the read of refcount, from
>>> https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access,
>>> it states that "simple reads and writes to properly-aligned 32-bit
>>> variables are atomic operations". Maybe I should declare refcount
>>> explicitly as DWORD32?
>>
>> It isn't the atomicity in question with the naked read but the
>> visibility. Any latency in the visibility of the store done by the
>> InterLocked*() function should be handled by the retry loop, but what
>> is to stop the C++ compiler from hoisting the read of refcount out of
>> the loop? It isn't even volatile (which has a stronger meaning in VS
>> than regular C+++).
> I see what you mean now, I was thinking on atomicity and order of
> operations but didn't consider the visibility of that read. Yes, if the
> compiler decides to be smart and hoist the read out of the loop we might
> never notice that it is safe to release those resources and we would
> leak them for no reason. I see from the windows
> docs(https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers)
> that declaring it volatile as you pointed out should be enough to
> prevent that.
>
>>> Instead of having a refcount we could have done something similar to
>>> the stream struct and protect access to the connection through a
>>> mutex. To avoid serializing all threads we could have used SRW locks
>>> and only the one closing the connection would do
>>> AcquireSRWLockExclusive(). It would change the state of the
>>> connection to STATE_CLOSED, close all handles, and then release the
>>> mutex. ENTER_CONNECTION() and LEAVE_CONNECTION() would acquire and
>>> release the mutex in shared mode. But other that maybe be more easy
>>> to read I don't think the change will be smaller.
>>>
>>>> 413 while (attempts>0) {
>>>>
>>>> spaces around >
>>> Done.
>>>
>>>> If the loop at 413 never encounters a zero reference_count then it
>>>> doesn't close the events or the mutex but still returns SYS_OK. That
>>>> seems wrong but I'm not sure what the right behaviour is here.
>>> I can change the return value to be SYS_ERR, but I don't think there
>>> is much we can do about it unless we want to wait forever until we
>>> can release those resources.
>>
>> SYS_ERR would look better, but I see now that the return value is
>> completely ignored anyway. So we're just going to leak resources if
>> the loop "times out". I guess this is the best we can do.
> Here is v2 with the corrections:
>
> Full: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/
> Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/
> <http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/> (not sure
> why the indent fixes are not highlighted as changes but the Frames view
> does show they changed)
>
> I'll give it a run on mach5 adding tier5 as Serguei suggested.
>
>
> Thanks,
> Patricio
>> Thanks,
>> David
>>
>>>
>>>> And please wait for serviceability folk to review this.
>>> Sounds good.
>>>
>>>
>>> Thanks for looking at this David! I will move the MemoryBarrier() and
>>> change the refcount to be DWORD32 if you are okay with that.
>>>
>>>
>>> Thanks,
>>> Patricio
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>> Tested in mach5 with the current baseline, tiers1-3 and several
>>>>> runs of open/test/langtools/:tier1 which includes the jshell tests
>>>>> where this connector is used. I also applied patch
>>>>> http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev
>>>>> mentioned in the comments of the bug, on top of the baseline and
>>>>> run the langtool tests with and without this fix. Without the fix
>>>>> running around 30 repetitions already shows failures in tests
>>>>> jdk/jshell/FailOverExecutionControlTest.java and
>>>>> jdk/jshell/FailOverExecutionControlHangingLaunchTest.java. With the
>>>>> fix I run several hundred runs and saw no failures. Let me know if
>>>>> there is any additional testing I should do.
>>>>>
>>>>> As a side note, I see there are a couple of open issues related
>>>>> with jshell failures (8209848) which could be related to this bug
>>>>> and therefore might be fixed by this patch.
>>>>>
>>>>> Thanks,
>>>>> Patricio
>>>>>
>>>
>
More information about the serviceability-dev
mailing list