RFR: 8032901: WaitForMultipleObjects() return value not handled appropriately

Aleksej Efimov aleksej.efimov at oracle.com
Wed Jun 4 11:43:05 UTC 2014


Thank you David.
Thank you Staffan.

On 06/04/2014 01:46 PM, David Holmes wrote:
> Me too.
>
> Thanks,
> David
>
> On 4/06/2014 5:23 PM, Staffan Larsen wrote:
>> Looks ok to me.
>>
>> /Staffan
>>
>> On 3 jun 2014, at 15:49, Aleksej Efimov <aleksej.efimov at oracle.com> 
>> wrote:
>>
>>> Staffan, David,
>>>
>>> Returning back to our WaitForMultipleObjects()/WAIT_ABANDONED 
>>> discussions:
>>> Thank you for your comments and remarks. I can't disagree with 
>>> motivation that it's better to have a fatal error during the 
>>> incorrect mutex handling then data corruption (the consequence of 
>>> the previous fix). In case of such error it'll be much more easier 
>>> to debug/catch it (but as Staffan said - we have tried to check all 
>>> call paths and don't think that we'll encounter such behavior).
>>> I have modified the discussed code according to your suggestions: 
>>> http://cr.openjdk.java.net/~aefimov/8032901/9/webrev.01/
>>> To abort the process the 'exitTransportWithError' function was 
>>> utilized.
>>> Also I have tried to check that behavior isn't changed by running 
>>> "svc" regression tests set. There was no related test failures 
>>> observed.
>>>
>>> Best Regards,
>>> Aleksej
>>>
>>> On 05/15/2014 01:11 PM, Staffan Larsen wrote:
>>>> On 15 maj 2014, at 03:48, David Holmes <david.holmes at oracle.com> 
>>>> wrote:
>>>>
>>>>> On 14/05/2014 11:18 PM, Aleksej Efimov wrote:
>>>>>> David, Vitaly,
>>>>>>
>>>>>> I totally agree with Vitaly's explanation (Vitaly - thank you for 
>>>>>> that)
>>>>>> and code in shmemBase.c (the usage of enterMutex() function for
>>>>>> sending/receiving bytes through shared memory connection) 
>>>>>> illustrates on
>>>>>> how the connection shutdown event is used as a "cancellation token".
>>>>> Thanks for clarifying that.
>>>>>
>>>>> So if we were to encounter an abandoned mutex the code would 
>>>>> presently have acquired the mutex but return an error, thus 
>>>>> preventing a subsequent release, and preventing other threads from 
>>>>> acquiring (but allowing current thread to recurisevely acquire. So 
>>>>> this could both hang and cause data corruption.
>>>>>
>>>>> The new code will still return an error but release the mutex. So 
>>>>> no more hangs (other than by conditions caused by data corruption) 
>>>>> but more opportunity for data corruption.
>>>>>
>>>>> Obviously it depends on exactly what happens in the critical 
>>>>> sections guarded by this mutex, but in general I don't agree with 
>>>>> this rationale for making the change:
>>>>>
>>>>> 204     /* If the mutex is abandoned we want to return an error
>>>>> 205      * and also release the mutex so that we don't cause
>>>>> 206      * other threads to be blocked. If a mutex was abandoned
>>>>> 207      * we are in scary state. Data may be corrupted or 
>>>>> inconsistent
>>>>> 208      * but it is still better to let other threads run (and 
>>>>> possibly
>>>>> 209      * crash) than having them blocked (on the premise that a 
>>>>> crash
>>>>> 210      * is always easier to debug than a hang).
>>>>>
>>>>> Considering something has to have gone drastically wrong for the 
>>>>> mutex to become abandoned, I'm more inclined to consider this a 
>>>>> fatal error and abort.
>>>>>
>>>>> But I'll let the serviceability folk chime in here.
>>>> I was involved in fixing this and writing the comment, so obviously 
>>>> I thought it a good solution :-)
>>>>
>>>> I do agree that it would probably be a good idea to consider this a 
>>>> fatal error and abort. At that point in the code we don’t have any 
>>>> really nice ways of doing that, though. We could just print an 
>>>> error and call abort(). What we are doing now is returning an error 
>>>> from sysIPMutexEnter() and delegating the error handling to the 
>>>> caller. We have tried to check all call paths to verify that they 
>>>> do “the right thing” in the face of an error. It is obviously hard 
>>>> to verify, but it looks like they all terminate the connection with 
>>>> some kind of error message.
>>>>
>>>> /Staffan
>>>>
>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>>> Thank you,
>>>>>> -Aleksej
>>>>>>
>>>>>>
>>>>>> On 05/14/2014 01:05 PM, David Holmes wrote:
>>>>>>> On 14/05/2014 11:06 AM, Vitaly Davidovich wrote:
>>>>>>>> In windows, you acquire a mutex by waiting on it using one of 
>>>>>>>> the wait
>>>>>>>> functions, one of them employed in the code in question.  If
>>>>>>>> WaitForMultipleObjects succeeds and returns the index of the 
>>>>>>>> mutex,
>>>>>>>> current thread has ownership now.
>>>>>>> Yes I understand the basic mechanics :)
>>>>>>>
>>>>>>>> It's also common to use multi wait functions where the event is a
>>>>>>>> "cancelation token", e.g. manual reset event; this allows 
>>>>>>>> someone to
>>>>>>>> cancel waiting on mutex acquisition and return from the wait 
>>>>>>>> function.
>>>>>>>> Presumably that's the case here, but I'll let Aleksej confirm; 
>>>>>>>> just
>>>>>>>> wanted to throw this out there in the meantime :).
>>>>>>> Ah I see - yes cancellable lock acquisition would make sense.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> Sent from my phone
>>>>>>>>
>>>>>>>> On May 13, 2014 6:46 PM, "David Holmes" <david.holmes at oracle.com
>>>>>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>>>>>
>>>>>>>>     Hi Aleksej,
>>>>>>>>
>>>>>>>>     Thanks for the doc references regarding abandonment.
>>>>>>>>
>>>>>>>>     Let me rephrase my question. What is this logic trying to 
>>>>>>>> achieve by
>>>>>>>>     waiting on both a mutex and an event? Do we already own the 
>>>>>>>> mutex
>>>>>>>>     when this function is called?
>>>>>>>>
>>>>>>>>     David
>>>>>>>>
>>>>>>>>     On 13/05/2014 11:19 PM, Aleksej Efimov wrote:
>>>>>>>>
>>>>>>>>         David,
>>>>>>>>
>>>>>>>>         The Windows has a different terminology for mutex 
>>>>>>>> objects (much
>>>>>>>>         differs
>>>>>>>>         from the POSIX one). This one link gave me some 
>>>>>>>> understanding of
>>>>>>>>         it [1].
>>>>>>>>
>>>>>>>>         Here is the MSDN [1] description of what "abandoned 
>>>>>>>> mutex" is:
>>>>>>>>         " If a thread terminates without releasing its 
>>>>>>>> ownership of a
>>>>>>>> mutex
>>>>>>>>         object, the mutex object is considered to be abandoned. A
>>>>>>>>         waiting thread
>>>>>>>>         can acquire ownership of an abandoned mutex object, but 
>>>>>>>> the wait
>>>>>>>>         function will return*WAIT_ABANDONED*to indicate that 
>>>>>>>> the mutex
>>>>>>>>         object is
>>>>>>>>         abandoned. An abandoned mutex object indicates that an 
>>>>>>>> error has
>>>>>>>>         occurred and that any shared resource being protected 
>>>>>>>> by the
>>>>>>>> mutex
>>>>>>>>         object is in an undefined state. If the thread proceeds as
>>>>>>>>         though the
>>>>>>>>         mutex object had not been abandoned, it is no longer 
>>>>>>>> considered
>>>>>>>>         abandoned after the thread releases its ownership. This 
>>>>>>>> restores
>>>>>>>>         normal
>>>>>>>>         behavior if a handle to the mutex object is subsequently
>>>>>>>>         specified in a
>>>>>>>>         wait function."
>>>>>>>>
>>>>>>>>
>>>>>>>>         What does it mean to wait on mutex and ownership of the 
>>>>>>>> mutex
>>>>>>>>         object:
>>>>>>>>         "Any thread with a handle to a mutex object can use one of
>>>>>>>> thewait
>>>>>>>>         functions
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687069%28v=vs.85%29.aspx 
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687069%28v=vs.85%29.aspx>>to 
>>>>>>>>
>>>>>>>>         request ownership of the mutex object. If the mutex 
>>>>>>>> object is
>>>>>>>>         owned by
>>>>>>>>         another thread, the wait function blocks the requesting 
>>>>>>>> thread
>>>>>>>>         until the
>>>>>>>>         owning thread releases the mutex object using 
>>>>>>>> the*ReleaseMutex*
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms685066%28v=vs.85%29.aspx 
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685066%28v=vs.85%29.aspx>>__function." 
>>>>>>>>
>>>>>>>>
>>>>>>>>         How we can release mutex and wait on already owned mutex:
>>>>>>>>         " After a thread obtains ownership of a mutex, it can 
>>>>>>>> specify
>>>>>>>>         the same
>>>>>>>>         mutex in repeated calls to the wait-functions
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687069%28v=vs.85%29.aspx 
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687069%28v=vs.85%29.aspx>>__without 
>>>>>>>>
>>>>>>>>         blocking its execution. This prevents a thread from 
>>>>>>>> deadlocking
>>>>>>>>         itself
>>>>>>>>         while waiting for a mutex that it already owns. To 
>>>>>>>> release its
>>>>>>>>         ownership
>>>>>>>>         under such circumstances, the thread must 
>>>>>>>> call*ReleaseMutex*
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms685066%28v=vs.85%29.aspx 
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685066%28v=vs.85%29.aspx>>__once 
>>>>>>>>
>>>>>>>>         for each time that the mutex satisfied the conditions 
>>>>>>>> of a wait
>>>>>>>>         function."
>>>>>>>>
>>>>>>>>         [1]
>>>>>>>> http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms684266(v=vs.85).aspx 
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms684266(v=vs.85).aspx> 
>>>>>>>>
>>>>>>>>
>>>>>>>>         -Aleksej
>>>>>>>>
>>>>>>>>         On 05/13/2014 04:00 PM, David Holmes wrote:
>>>>>>>>
>>>>>>>>             I don't understand this one at all. What is an 
>>>>>>>> "abandoned
>>>>>>>>             mutex"? For
>>>>>>>>             that matter why does the code wait on a mutex and 
>>>>>>>> an event?
>>>>>>>>             Do we
>>>>>>>>             already own the mutex? If so what does it mean to 
>>>>>>>> wait on
>>>>>>>>             it? If not
>>>>>>>>             then how can we release it?
>>>>>>>>
>>>>>>>>             ???
>>>>>>>>
>>>>>>>>             Thanks,
>>>>>>>>             David
>>>>>>>>
>>>>>>>>
>>>>>>>>             On 13/05/2014 8:57 PM, Alan Bateman wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>                 This is debugger's shared memory transport so 
>>>>>>>> cc'ing
>>>>>>>>                 serviceability-dev
>>>>>>>>                 as that is there this code is maintained.
>>>>>>>>
>>>>>>>>                 Is there a test case or any outline of the 
>>>>>>>> conditions
>>>>>>>>                 that cause this? I
>>>>>>>>                 think that would be useful to understand the issue
>>>>>>>> further.
>>>>>>>>
>>>>>>>>                 -Alan
>>>>>>>>
>>>>>>>>                 On 13/05/2014 11:46, Aleksej Efimov wrote:
>>>>>>>>
>>>>>>>>                     Hi,
>>>>>>>>
>>>>>>>>                     Can I have a review for 8032901 bug [1] fix 
>>>>>>>> [2].
>>>>>>>>                     There is a possible
>>>>>>>>                     case when 'WaitForMultipleObjects' function 
>>>>>>>> can
>>>>>>>>                     return the
>>>>>>>>                     WAIT_ABANDONED_0 [3] error value.
>>>>>>>>                     In such case it's better to release the 
>>>>>>>> mutex and
>>>>>>>>                     return error value.
>>>>>>>>                     This will prevent other threads to be 
>>>>>>>> blocked on
>>>>>>>>                     abandoned mutex.
>>>>>>>>
>>>>>>>>                     Thank you,
>>>>>>>>                     Aleksej
>>>>>>>>
>>>>>>>>                     [1]
>>>>>>>> https://bugs.openjdk.java.net/__browse/JDK-8032901
>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8032901>
>>>>>>>>                     [2]
>>>>>>>> http://cr.openjdk.java.net/~__aefimov/8032901/9/webrev.00/
>>>>>>>> <http://cr.openjdk.java.net/~aefimov/8032901/9/webrev.00/>
>>>>>>>>                     [3]
>>>>>>>> http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687025(v=vs.85).aspx 
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687025(v=vs.85).aspx> 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>>



More information about the serviceability-dev mailing list