RFR 8240902: JDI shared memory connector can use already closed Handles

Wed Mar 18 23:10:44 UTC 2020

Hi Patricio,

On 19/03/2020 6:44 am, Patricio Chilano wrote:
> Hi David,
> 
> On 3/18/20 4:27 AM, David Holmes wrote:
>> Hi Patricio,
>>
>> On 18/03/2020 6:14 am, Patricio Chilano wrote:
>>> Hi all,
>>>
>>> Please review the following patch:
>>>
>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8240902
>>> Webrev: http://cr.openjdk.java.net/~pchilanomate/8240902/v1/webrev/
>>>
>>> Calling closeConnection() on an already created/opened connection 
>>> includes calls to CloseHandle() on objects that can still be used by 
>>> other threads. This can lead to either undefined behavior or, as 
>>> detailed in the bug comments, changes of state of unrelated objects. 
>>
>> This was a really great find!
> Thanks!  : )
> 
>>> This issue was found while debugging the reason behind some jshell 
>>> test failures seen after pushing 8230594. Not as important, but there 
>>> are also calls to closeStream() from createStream()/openStream() when 
>>> failing to create/open a stream that will return after executing 
>>> "CHECK_ERROR(enterMutex(stream, NULL));" without closing the intended 
>>> resources. Then, calling closeConnection() could assert if the reason 
>>> of the previous failure was that the stream's mutex failed to be 
>>> created/opened. These patch aims to address these issues too.
>>
>> Patch looks good in general. The internal reference count guards 
>> deletion of the internal resources, and is itself safe because never 
>> actually delete the connection. Thanks for adding the comment about 
>> this aspect.
>>
>> A few items:
>>
>> Please update copyright year before pushing.
> Done.
> 
>> Please align ENTER_CONNECTION/LEAVE_CONNECTION macros the same way as 
>> STREAM_INVARIANT.
> Done.
> 
>>  170 unsigned int refcount;
>>  171     jint state;
>>
>> I'm unclear about the use of stream->state and connection->state as 
>> guards - unless accessed under a mutex these would seem to at least 
>> need acquire/release semantics.
>>
>> Additionally the reads of refcount would also seem to need to some 
>> form of memory synchronization - though the Windows docs for the 
>> Interlocked* API does not show how to simply read such a variable! 
>> Though I note that the RtlFirstEntrySList method for the "Interlocked 
>> Singly Linked Lists" API does state "Access to the list is 
>> synchronized on a multiprocessor system." which suggests a read of 
>> such a variable does require some form of memory synchronization!
> In the case of the stream struct, the state field is protected by the 
> mutex field. It is set to STATE_CLOSED while holding the mutex, and 
> threads that read it must acquire the mutex first through 
> sysIPMutexEnter(). For the cases where sysIPMutexEnter() didn't acquire 
> the mutex, we will return something different than SYS_OK and the call 
> will exit anyways. All this behaves as before, I didn't change it.

Thanks for clarifying.

> The refcount and state that I added to the SharedMemoryConnection struct 
> work together. For a thread closing the connection, setting the 
> connection state to STATE_CLOSED has to happen before reading the 
> refcount (more on the atomicity of that read later). That's why I added 
> the MemoryBarrier() call; which I see it's better if I just move it to 
> after setting the connection state to closed. For the threads accessing 
> the connection, incrementing the refcount has to happen before reading 
> the connection state. That's already provided by the 
> InterlockedIncrement() which uses a full memory barrier. In this way if 
> the thread closing the connection reads a refcount of 0, then we know 
> it's safe to release the resources, since other threads accessing the 
> connection will see that the state is closed after incrementing the 
> refcount. If the read of refcount is not 0, then it could be that a 
> thread is accessing the connection or not (it could have read a state 
> connection of STATE_CLOSED after incrementing the refcount), we don't 
> know, so we can't release anything. Similarly if the thread accessing 
> the connection reads that the state is not closed, then we know it's 
> safe to access the stream since anybody closing the connection will 
> still have to read refcount which will be at least 1.
> As for the atomicity of the read of refcount, from 
> https://docs.microsoft.com/en-us/windows/win32/sync/interlocked-variable-access, 
> it states that "simple reads and writes to properly-aligned 32-bit 
> variables are atomic operations". Maybe I should declare refcount 
> explicitly as DWORD32?

It isn't the atomicity in question with the naked read but the 
visibility. Any latency in the visibility of the store done by the 
InterLocked*() function should be handled by the retry loop, but what is 
to stop the C++ compiler from hoisting the read of refcount out of the 
loop? It isn't even volatile (which has a stronger meaning in VS than 
regular C+++).

> Instead of having a refcount we could have done something similar to the 
> stream struct and protect access to the connection through a mutex. To 
> avoid serializing all threads we could have used SRW locks and only the 
> one closing the connection would do AcquireSRWLockExclusive(). It would 
> change the state of the connection to STATE_CLOSED, close all handles, 
> and then release the mutex. ENTER_CONNECTION() and LEAVE_CONNECTION() 
> would acquire and release the mutex in shared mode. But other that maybe 
> be more easy to read I don't think the change will be smaller.
> 
>>  413 while (attempts>0) {
>>
>> spaces around >
> Done.
> 
>> If the loop at 413 never encounters a zero reference_count then it 
>> doesn't close the events or the mutex but still returns SYS_OK. That 
>> seems wrong but I'm not sure what the right behaviour is here.
> I can change the return value to be SYS_ERR, but I don't think there is 
> much we can do about it unless we want to wait forever until we can 
> release those resources.

SYS_ERR would look better, but I see now that the return value is 
completely ignored anyway. So we're just going to leak resources if the 
loop "times out". I guess this is the best we can do.

Thanks,
David

> 
>> And please wait for serviceability folk to review this.
> Sounds good.
> 
> 
> Thanks for looking at this David! I will move the MemoryBarrier() and 
> change the refcount to be DWORD32 if you are okay with that.
> 
> 
> Thanks,
> Patricio
>> Thanks,
>> David
>> -----
>>
>>> Tested in mach5 with the current baseline, tiers1-3 and several runs 
>>> of open/test/langtools/:tier1 which includes the jshell tests where 
>>> this connector is used. I also applied patch 
>>> http://cr.openjdk.java.net/~pchilanomate/8240902/triggerbug/webrev 
>>> mentioned in the comments of the bug, on top of the baseline and run 
>>> the langtool tests with and without this fix. Without the fix running 
>>> around 30 repetitions already shows failures in tests 
>>> jdk/jshell/FailOverExecutionControlTest.java and 
>>> jdk/jshell/FailOverExecutionControlHangingLaunchTest.java. With the 
>>> fix I run several hundred runs and saw no failures. Let me know if 
>>> there is any additional testing I should do.
>>>
>>> As a side note, I see there are a couple of open issues related with 
>>> jshell failures (8209848) which could be related to this bug and 
>>> therefore might be fixed by this patch.
>>>
>>> Thanks,
>>> Patricio
>>>
>