RFR 8240902: JDI shared memory connector can use already closed Handles

Thu Mar 19 07:50:55 UTC 2020

Hi Patricio,

Incremental changes look good.

Thanks,
David

On 19/03/2020 4:18 pm, Patricio Chilano wrote:
> Hi David,
> 
> On 3/18/20 8:10 PM, David Holmes wrote:
>> Hi Patricio,
>>
>> On 19/03/2020 6:44 am, Patricio Chilano wrote:
>>> Hi David,
>>>
>>> On 3/18/20 4:27 AM, David Holmes wrote:
>>>> Hi Patricio,
>>>>
>>>> On 18/03/2020 6:14 am, Patricio Chilano wrote:
>>>>> Hi all,
>>>>>
>>>>> Please review the following patch:
>>>>>
>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240902
>>>>> Webrev: http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/
>>>>>
>>>>> Calling closeConnection() on an already created/opened connection 
>>>>> includes calls to CloseHandle() on objects that can still be used 
>>>>> by other threads. This can lead to either undefined behavior or, as 
>>>>> detailed in the bug comments, changes of state of unrelated objects. 
>>>>
>>>> This was a really great find!
>>> Thanks!  : )
>>>
>>>>> This issue was found while debugging the reason behind some jshell 
>>>>> test failures seen after pushing 8230594. Not as important, but 
>>>>> there are also calls to closeStream() from 
>>>>> createStream()/openStream() when failing to create/open a stream 
>>>>> that will return after executing "CHECK_ERROR(enterMutex(stream, 
>>>>> NULL));" without closing the intended resources. Then, calling 
>>>>> closeConnection() could assert if the reason of the previous 
>>>>> failure was that the stream's mutex failed to be created/opened. 
>>>>> These patch aims to address these issues too.
>>>>
>>>> Patch looks good in general. The internal reference count guards 
>>>> deletion of the internal resources, and is itself safe because never 
>>>> actually delete the connection. Thanks for adding the comment about 
>>>> this aspect.
>>>>
>>>> A few items:
>>>>
>>>> Please update copyright year before pushing.
>>> Done.
>>>
>>>> Please align ENTER_CONNECTION/LEAVE_CONNECTION macros the same way 
>>>> as STREAM_INVARIANT.
>>> Done.
>>>
>>>>  170 unsigned int refcount;
>>>>  171     jint state;
>>>>
>>>> I'm unclear about the use of stream->state and connection->state as 
>>>> guards - unless accessed under a mutex these would seem to at least 
>>>> need acquire/release semantics.
>>>>
>>>> Additionally the reads of refcount would also seem to need to some 
>>>> form of memory synchronization - though the Windows docs for the 
>>>> Interlocked* API does not show how to simply read such a variable! 
>>>> Though I note that the RtlFirstEntrySList method for the 
>>>> "Interlocked Singly Linked Lists" API does state "Access to the list 
>>>> is synchronized on a multiprocessor system." which suggests a read 
>>>> of such a variable does require some form of memory synchronization!
>>> In the case of the stream struct, the state field is protected by the 
>>> mutex field. It is set to STATE_CLOSED while holding the mutex, and 
>>> threads that read it must acquire the mutex first through 
>>> sysIPMutexEnter(). For the cases where sysIPMutexEnter() didn't 
>>> acquire the mutex, we will return something different than SYS_OK and 
>>> the call will exit anyways. All this behaves as before, I didn't 
>>> change it.
>>
>> Thanks for clarifying.
>>
>>> The refcount and state that I added to the SharedMemoryConnection 
>>> struct work together. For a thread closing the connection, setting 
>>> the connection state to STATE_CLOSED has to happen before reading the 
>>> refcount (more on the atomicity of that read later). That's why I 
>>> added the MemoryBarrier() call; which I see it's better if I just 
>>> move it to after setting the connection state to closed. For the 
>>> threads accessing the connection, incrementing the refcount has to 
>>> happen before reading the connection state. That's already provided 
>>> by the InterlockedIncrement() which uses a full memory barrier. In 
>>> this way if the thread closing the connection reads a refcount of 0, 
>>> then we know it's safe to release the resources, since other threads 
>>> accessing the connection will see that the state is closed after 
>>> incrementing the refcount. If the read of refcount is not 0, then it 
>>> could be that a thread is accessing the connection or not (it could 
>>> have read a state connection of STATE_CLOSED after incrementing the 
>>> refcount), we don't know, so we can't release anything. Similarly if 
>>> the thread accessing the connection reads that the state is not 
>>> closed, then we know it's safe to access the stream since anybody 
>>> closing the connection will still have to read refcount which will be 
>>> at least 1.
>>> As for the atomicity of the read of refcount, from 
>>> https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access, 
>>> it states that "simple reads and writes to properly-aligned 32-bit 
>>> variables are atomic operations". Maybe I should declare refcount 
>>> explicitly as DWORD32?
>>
>> It isn't the atomicity in question with the naked read but the 
>> visibility. Any latency in the visibility of the store done by the 
>> InterLocked*() function should be handled by the retry loop, but what 
>> is to stop the C++ compiler from hoisting the read of refcount out of 
>> the loop? It isn't even volatile (which has a stronger meaning in VS 
>> than regular C+++).
> I see what you mean now, I was thinking on atomicity and order of 
> operations but didn't consider the visibility of that read. Yes, if the 
> compiler decides to be smart and hoist the read out of the loop we might 
> never notice that it is safe to release those resources and we would 
> leak them for no reason. I see from the windows 
> docs(https://docs.microsoft.com/en-us/cpp/c-language/type-qualifiers) 
> that declaring it volatile as you pointed out should be enough to 
> prevent that.
> 
>>> Instead of having a refcount we could have done something similar to 
>>> the stream struct and protect access to the connection through a 
>>> mutex. To avoid serializing all threads we could have used SRW locks 
>>> and only the one closing the connection would do 
>>> AcquireSRWLockExclusive(). It would change the state of the 
>>> connection to STATE_CLOSED, close all handles, and then release the 
>>> mutex. ENTER_CONNECTION() and LEAVE_CONNECTION() would acquire and 
>>> release the mutex in shared mode. But other that maybe be more easy 
>>> to read I don't think the change will be smaller.
>>>
>>>>  413 while (attempts>0) {
>>>>
>>>> spaces around >
>>> Done.
>>>
>>>> If the loop at 413 never encounters a zero reference_count then it 
>>>> doesn't close the events or the mutex but still returns SYS_OK. That 
>>>> seems wrong but I'm not sure what the right behaviour is here.
>>> I can change the return value to be SYS_ERR, but I don't think there 
>>> is much we can do about it unless we want to wait forever until we 
>>> can release those resources.
>>
>> SYS_ERR would look better, but I see now that the return value is 
>> completely ignored anyway. So we're just going to leak resources if 
>> the loop "times out". I guess this is the best we can do.
> Here is v2 with the corrections:
> 
> Full: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/webrev/
> Inc: http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/webrev/ 
> <http://cr.openjdk.java.net/~pchilanomate/8240902/v2/inc/>   (not sure 
> why the indent fixes are not highlighted as changes but the Frames view 
> does show they changed)
> 
> I'll give it a run on mach5 adding tier5 as Serguei suggested.
> 
> 
> Thanks,
> Patricio
>> Thanks,
>> David
>>
>>>
>>>> And please wait for serviceability folk to review this.
>>> Sounds good.
>>>
>>>
>>> Thanks for looking at this David! I will move the MemoryBarrier() and 
>>> change the refcount to be DWORD32 if you are okay with that.
>>>
>>>
>>> Thanks,
>>> Patricio
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>> Tested in mach5 with the current baseline, tiers1-3 and several 
>>>>> runs of open/test/langtools/:tier1 which includes the jshell tests 
>>>>> where this connector is used. I also applied patch 
>>>>> http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev 
>>>>> mentioned in the comments of the bug, on top of the baseline and 
>>>>> run the langtool tests with and without this fix. Without the fix 
>>>>> running around 30 repetitions already shows failures in tests 
>>>>> jdk/jshell/FailOverExecutionControlTest.java and 
>>>>> jdk/jshell/FailOverExecutionControlHangingLaunchTest.java. With the 
>>>>> fix I run several hundred runs and saw no failures. Let me know if 
>>>>> there is any additional testing I should do.
>>>>>
>>>>> As a side note, I see there are a couple of open issues related 
>>>>> with jshell failures (8209848) which could be related to this bug 
>>>>> and therefore might be fixed by this patch.
>>>>>
>>>>> Thanks,
>>>>> Patricio
>>>>>
>>>
>