RFR 8240902: JDI shared memory connector can use already closed Handles
Patricio Chilano
patricio.chilano.mateo at oracle.com
Thu Mar 19 14:43:27 UTC 2020
On 3/19/20 11:22 AM, Daniel D. Daugherty wrote:
> > Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/
> > (not sure why the indent fixes are not highlighted as changes but
> the Frames view does show they changed)
>
> By default, webrev ignores leading and trailing whitespace changes. Use:
>
> -b: Do not ignore changes in the amount of white space.
>
> if you want to see them. I'm okay that they are not there in most of
> the views. If you want to see them, look at the patch.
Thanks! I didn't know that option.
>
> src/jdk.jdi/share/native/libdt_shmem/shmemBase.c
> No comments.
>
> Thumbs up.
Thanks for looking at this Dan!
Patricio
> Dan
>
>
> On 3/19/20 2:18 AM, Patricio Chilano wrote:
>> Hi David,
>>
>> On 3/18/20 8:10 PM, David Holmes wrote:
>>> Hi Patricio,
>>>
>>> On 19/03/2020 6:44 am, Patricio Chilano wrote:
>>>> Hi David,
>>>>
>>>> On 3/18/20 4:27 AM, David Holmes wrote:
>>>>> Hi Patricio,
>>>>>
>>>>> On 18/03/2020 6:14 am, Patricio Chilano wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Please review the following patch:
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240902
>>>>>> Webrev: http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/
>>>>>>
>>>>>> Calling closeConnection() on an already created/opened connection
>>>>>> includes calls to CloseHandle() on objects that can still be used
>>>>>> by other threads. This can lead to either undefined behavior or,
>>>>>> as detailed in the bug comments, changes of state of unrelated
>>>>>> objects.
>>>>>
>>>>> This was a really great find!
>>>> Thanks! : )
>>>>
>>>>>> This issue was found while debugging the reason behind some
>>>>>> jshell test failures seen after pushing 8230594. Not as
>>>>>> important, but there are also calls to closeStream() from
>>>>>> createStream()/openStream() when failing to create/open a stream
>>>>>> that will return after executing "CHECK_ERROR(enterMutex(stream,
>>>>>> NULL));" without closing the intended resources. Then, calling
>>>>>> closeConnection() could assert if the reason of the previous
>>>>>> failure was that the stream's mutex failed to be created/opened.
>>>>>> These patch aims to address these issues too.
>>>>>
>>>>> Patch looks good in general. The internal reference count guards
>>>>> deletion of the internal resources, and is itself safe because
>>>>> never actually delete the connection. Thanks for adding the
>>>>> comment about this aspect.
>>>>>
>>>>> A few items:
>>>>>
>>>>> Please update copyright year before pushing.
>>>> Done.
>>>>
>>>>> Please align ENTER_CONNECTION/LEAVE_CONNECTION macros the same way
>>>>> as STREAM_INVARIANT.
>>>> Done.
>>>>
>>>>> 170 unsigned int refcount;
>>>>> 171 jint state;
>>>>>
>>>>> I'm unclear about the use of stream->state and connection->state
>>>>> as guards - unless accessed under a mutex these would seem to at
>>>>> least need acquire/release semantics.
>>>>>
>>>>> Additionally the reads of refcount would also seem to need to some
>>>>> form of memory synchronization - though the Windows docs for the
>>>>> Interlocked* API does not show how to simply read such a variable!
>>>>> Though I note that the RtlFirstEntrySList method for the
>>>>> "Interlocked Singly Linked Lists" API does state "Access to the
>>>>> list is synchronized on a multiprocessor system." which suggests a
>>>>> read of such a variable does require some form of memory
>>>>> synchronization!
>>>> In the case of the stream struct, the state field is protected by
>>>> the mutex field. It is set to STATE_CLOSED while holding the mutex,
>>>> and threads that read it must acquire the mutex first through
>>>> sysIPMutexEnter(). For the cases where sysIPMutexEnter() didn't
>>>> acquire the mutex, we will return something different than SYS_OK
>>>> and the call will exit anyways. All this behaves as before, I
>>>> didn't change it.
>>>
>>> Thanks for clarifying.
>>>
>>>> The refcount and state that I added to the SharedMemoryConnection
>>>> struct work together. For a thread closing the connection, setting
>>>> the connection state to STATE_CLOSED has to happen before reading
>>>> the refcount (more on the atomicity of that read later). That's why
>>>> I added the MemoryBarrier() call; which I see it's better if I just
>>>> move it to after setting the connection state to closed. For the
>>>> threads accessing the connection, incrementing the refcount has to
>>>> happen before reading the connection state. That's already provided
>>>> by the InterlockedIncrement() which uses a full memory barrier. In
>>>> this way if the thread closing the connection reads a refcount of
>>>> 0, then we know it's safe to release the resources, since other
>>>> threads accessing the connection will see that the state is closed
>>>> after incrementing the refcount. If the read of refcount is not 0,
>>>> then it could be that a thread is accessing the connection or not
>>>> (it could have read a state connection of STATE_CLOSED after
>>>> incrementing the refcount), we don't know, so we can't release
>>>> anything. Similarly if the thread accessing the connection reads
>>>> that the state is not closed, then we know it's safe to access the
>>>> stream since anybody closing the connection will still have to read
>>>> refcount which will be at least 1.
>>>> As for the atomicity of the read of refcount, from
>>>> https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access,
>>>> it states that "simple reads and writes to properly-aligned 32-bit
>>>> variables are atomic operations". Maybe I should declare refcount
>>>> explicitly as DWORD32?
>>>
>>> It isn't the atomicity in question with the naked read but the
>>> visibility. Any latency in the visibility of the store done by the
>>> InterLocked*() function should be handled by the retry loop, but
>>> what is to stop the C++ compiler from hoisting the read of refcount
>>> out of the loop? It isn't even volatile (which has a stronger
>>> meaning in VS than regular C+++).
>> I see what you mean now, I was thinking on atomicity and order of
>> operations but didn't consider the visibility of that read. Yes, if
>> the compiler decides to be smart and hoist the read out of the loop
>> we might never notice that it is safe to release those resources and
>> we would leak them for no reason. I see from the windows
>> docs(https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers)
>> that declaring it volatile as you pointed out should be enough to
>> prevent that.
>>
>>>> Instead of having a refcount we could have done something similar
>>>> to the stream struct and protect access to the connection through a
>>>> mutex. To avoid serializing all threads we could have used SRW
>>>> locks and only the one closing the connection would do
>>>> AcquireSRWLockExclusive(). It would change the state of the
>>>> connection to STATE_CLOSED, close all handles, and then release the
>>>> mutex. ENTER_CONNECTION() and LEAVE_CONNECTION() would acquire and
>>>> release the mutex in shared mode. But other that maybe be more easy
>>>> to read I don't think the change will be smaller.
>>>>
>>>>> 413 while (attempts>0) {
>>>>>
>>>>> spaces around >
>>>> Done.
>>>>
>>>>> If the loop at 413 never encounters a zero reference_count then it
>>>>> doesn't close the events or the mutex but still returns SYS_OK.
>>>>> That seems wrong but I'm not sure what the right behaviour is here.
>>>> I can change the return value to be SYS_ERR, but I don't think
>>>> there is much we can do about it unless we want to wait forever
>>>> until we can release those resources.
>>>
>>> SYS_ERR would look better, but I see now that the return value is
>>> completely ignored anyway. So we're just going to leak resources if
>>> the loop "times out". I guess this is the best we can do.
>> Here is v2 with the corrections:
>>
>> Full: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/
>> Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/
>> <http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/> (not
>> sure why the indent fixes are not highlighted as changes but the
>> Frames view does show they changed)
>>
>> I'll give it a run on mach5 adding tier5 as Serguei suggested.
>>
>>
>> Thanks,
>> Patricio
>>> Thanks,
>>> David
>>>
>>>>
>>>>> And please wait for serviceability folk to review this.
>>>> Sounds good.
>>>>
>>>>
>>>> Thanks for looking at this David! I will move the MemoryBarrier()
>>>> and change the refcount to be DWORD32 if you are okay with that.
>>>>
>>>>
>>>> Thanks,
>>>> Patricio
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Tested in mach5 with the current baseline, tiers1-3 and several
>>>>>> runs of open/test/langtools/:tier1 which includes the jshell
>>>>>> tests where this connector is used. I also applied patch
>>>>>> http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev
>>>>>> mentioned in the comments of the bug, on top of the baseline and
>>>>>> run the langtool tests with and without this fix. Without the fix
>>>>>> running around 30 repetitions already shows failures in tests
>>>>>> jdk/jshell/FailOverExecutionControlTest.java and
>>>>>> jdk/jshell/FailOverExecutionControlHangingLaunchTest.java. With
>>>>>> the fix I run several hundred runs and saw no failures. Let me
>>>>>> know if there is any additional testing I should do.
>>>>>>
>>>>>> As a side note, I see there are a couple of open issues related
>>>>>> with jshell failures (8209848) which could be related to this bug
>>>>>> and therefore might be fixed by this patch.
>>>>>>
>>>>>> Thanks,
>>>>>> Patricio
>>>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20200319/06a7ff6a/attachment.htm>
More information about the serviceability-dev
mailing list