RFR 8240902: JDI shared memory connector can use already closed Handles

Thu Mar 19 14:43:27 UTC 2020

On 3/19/20 11:22 AM, Daniel D. Daugherty wrote:
> > Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/
> >    (not sure why the indent fixes are not highlighted as changes but 
> the Frames view does show they changed)
>
> By default, webrev ignores leading and trailing whitespace changes. Use:
>
>     -b: Do not ignore changes in the amount of white space.
>
> if you want to see them. I'm okay that they are not there in most of
> the views. If you want to see them, look at the patch.
Thanks! I didn't know that option.
>
> src/jdk.jdi/share/native/libdt_shmem/shmemBase.c
>     No comments.
>
> Thumbs up.
Thanks for looking at this Dan!


Patricio
> Dan
>
>
> On 3/19/20 2:18 AM, Patricio Chilano wrote:
>> Hi David,
>>
>> On 3/18/20 8:10 PM, David Holmes wrote:
>>> Hi Patricio,
>>>
>>> On 19/03/2020 6:44 am, Patricio Chilano wrote:
>>>> Hi David,
>>>>
>>>> On 3/18/20 4:27 AM, David Holmes wrote:
>>>>> Hi Patricio,
>>>>>
>>>>> On 18/03/2020 6:14 am, Patricio Chilano wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> Please review the following patch:
>>>>>>
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240902
>>>>>> Webrev: http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/
>>>>>>
>>>>>> Calling closeConnection() on an already created/opened connection 
>>>>>> includes calls to CloseHandle() on objects that can still be used 
>>>>>> by other threads. This can lead to either undefined behavior or, 
>>>>>> as detailed in the bug comments, changes of state of unrelated 
>>>>>> objects. 
>>>>>
>>>>> This was a really great find!
>>>> Thanks!  : )
>>>>
>>>>>> This issue was found while debugging the reason behind some 
>>>>>> jshell test failures seen after pushing 8230594. Not as 
>>>>>> important, but there are also calls to closeStream() from 
>>>>>> createStream()/openStream() when failing to create/open a stream 
>>>>>> that will return after executing "CHECK_ERROR(enterMutex(stream, 
>>>>>> NULL));" without closing the intended resources. Then, calling 
>>>>>> closeConnection() could assert if the reason of the previous 
>>>>>> failure was that the stream's mutex failed to be created/opened. 
>>>>>> These patch aims to address these issues too.
>>>>>
>>>>> Patch looks good in general. The internal reference count guards 
>>>>> deletion of the internal resources, and is itself safe because 
>>>>> never actually delete the connection. Thanks for adding the 
>>>>> comment about this aspect.
>>>>>
>>>>> A few items:
>>>>>
>>>>> Please update copyright year before pushing.
>>>> Done.
>>>>
>>>>> Please align ENTER_CONNECTION/LEAVE_CONNECTION macros the same way 
>>>>> as STREAM_INVARIANT.
>>>> Done.
>>>>
>>>>>  170 unsigned int refcount;
>>>>>  171     jint state;
>>>>>
>>>>> I'm unclear about the use of stream->state and connection->state 
>>>>> as guards - unless accessed under a mutex these would seem to at 
>>>>> least need acquire/release semantics.
>>>>>
>>>>> Additionally the reads of refcount would also seem to need to some 
>>>>> form of memory synchronization - though the Windows docs for the 
>>>>> Interlocked* API does not show how to simply read such a variable! 
>>>>> Though I note that the RtlFirstEntrySList method for the 
>>>>> "Interlocked Singly Linked Lists" API does state "Access to the 
>>>>> list is synchronized on a multiprocessor system." which suggests a 
>>>>> read of such a variable does require some form of memory 
>>>>> synchronization!
>>>> In the case of the stream struct, the state field is protected by 
>>>> the mutex field. It is set to STATE_CLOSED while holding the mutex, 
>>>> and threads that read it must acquire the mutex first through 
>>>> sysIPMutexEnter(). For the cases where sysIPMutexEnter() didn't 
>>>> acquire the mutex, we will return something different than SYS_OK 
>>>> and the call will exit anyways. All this behaves as before, I 
>>>> didn't change it.
>>>
>>> Thanks for clarifying.
>>>
>>>> The refcount and state that I added to the SharedMemoryConnection 
>>>> struct work together. For a thread closing the connection, setting 
>>>> the connection state to STATE_CLOSED has to happen before reading 
>>>> the refcount (more on the atomicity of that read later). That's why 
>>>> I added the MemoryBarrier() call; which I see it's better if I just 
>>>> move it to after setting the connection state to closed. For the 
>>>> threads accessing the connection, incrementing the refcount has to 
>>>> happen before reading the connection state. That's already provided 
>>>> by the InterlockedIncrement() which uses a full memory barrier. In 
>>>> this way if the thread closing the connection reads a refcount of 
>>>> 0, then we know it's safe to release the resources, since other 
>>>> threads accessing the connection will see that the state is closed 
>>>> after incrementing the refcount. If the read of refcount is not 0, 
>>>> then it could be that a thread is accessing the connection or not 
>>>> (it could have read a state connection of STATE_CLOSED after 
>>>> incrementing the refcount), we don't know, so we can't release 
>>>> anything. Similarly if the thread accessing the connection reads 
>>>> that the state is not closed, then we know it's safe to access the 
>>>> stream since anybody closing the connection will still have to read 
>>>> refcount which will be at least 1.
>>>> As for the atomicity of the read of refcount, from 
>>>> https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access, 
>>>> it states that "simple reads and writes to properly-aligned 32-bit 
>>>> variables are atomic operations". Maybe I should declare refcount 
>>>> explicitly as DWORD32?
>>>
>>> It isn't the atomicity in question with the naked read but the 
>>> visibility. Any latency in the visibility of the store done by the 
>>> InterLocked*() function should be handled by the retry loop, but 
>>> what is to stop the C++ compiler from hoisting the read of refcount 
>>> out of the loop? It isn't even volatile (which has a stronger 
>>> meaning in VS than regular C+++).
>> I see what you mean now, I was thinking on atomicity and order of 
>> operations but didn't consider the visibility of that read. Yes, if 
>> the compiler decides to be smart and hoist the read out of the loop 
>> we might never notice that it is safe to release those resources and 
>> we would leak them for no reason. I see from the windows 
>> docs(https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers) 
>> that declaring it volatile as you pointed out should be enough to 
>> prevent that.
>>
>>>> Instead of having a refcount we could have done something similar 
>>>> to the stream struct and protect access to the connection through a 
>>>> mutex. To avoid serializing all threads we could have used SRW 
>>>> locks and only the one closing the connection would do 
>>>> AcquireSRWLockExclusive(). It would change the state of the 
>>>> connection to STATE_CLOSED, close all handles, and then release the 
>>>> mutex. ENTER_CONNECTION() and LEAVE_CONNECTION() would acquire and 
>>>> release the mutex in shared mode. But other that maybe be more easy 
>>>> to read I don't think the change will be smaller.
>>>>
>>>>>  413 while (attempts>0) {
>>>>>
>>>>> spaces around >
>>>> Done.
>>>>
>>>>> If the loop at 413 never encounters a zero reference_count then it 
>>>>> doesn't close the events or the mutex but still returns SYS_OK. 
>>>>> That seems wrong but I'm not sure what the right behaviour is here.
>>>> I can change the return value to be SYS_ERR, but I don't think 
>>>> there is much we can do about it unless we want to wait forever 
>>>> until we can release those resources.
>>>
>>> SYS_ERR would look better, but I see now that the return value is 
>>> completely ignored anyway. So we're just going to leak resources if 
>>> the loop "times out". I guess this is the best we can do.
>> Here is v2 with the corrections:
>>
>> Full: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/
>> Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/ 
>> <http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/>   (not 
>> sure why the indent fixes are not highlighted as changes but the 
>> Frames view does show they changed)
>>
>> I'll give it a run on mach5 adding tier5 as Serguei suggested.
>>
>>
>> Thanks,
>> Patricio
>>> Thanks,
>>> David
>>>
>>>>
>>>>> And please wait for serviceability folk to review this.
>>>> Sounds good.
>>>>
>>>>
>>>> Thanks for looking at this David! I will move the MemoryBarrier() 
>>>> and change the refcount to be DWORD32 if you are okay with that.
>>>>
>>>>
>>>> Thanks,
>>>> Patricio
>>>>> Thanks,
>>>>> David
>>>>> -----
>>>>>
>>>>>> Tested in mach5 with the current baseline, tiers1-3 and several 
>>>>>> runs of open/test/langtools/:tier1 which includes the jshell 
>>>>>> tests where this connector is used. I also applied patch 
>>>>>> http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev 
>>>>>> mentioned in the comments of the bug, on top of the baseline and 
>>>>>> run the langtool tests with and without this fix. Without the fix 
>>>>>> running around 30 repetitions already shows failures in tests 
>>>>>> jdk/jshell/FailOverExecutionControlTest.java and 
>>>>>> jdk/jshell/FailOverExecutionControlHangingLaunchTest.java. With 
>>>>>> the fix I run several hundred runs and saw no failures. Let me 
>>>>>> know if there is any additional testing I should do.
>>>>>>
>>>>>> As a side note, I see there are a couple of open issues related 
>>>>>> with jshell failures (8209848) which could be related to this bug 
>>>>>> and therefore might be fixed by this patch.
>>>>>>
>>>>>> Thanks,
>>>>>> Patricio
>>>>>>
>>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.java.net/pipermail/serviceability-dev/attachments/20200319/06a7ff6a/attachment.htm>