RFR: 8032901: WaitForMultipleObjects() return value not handled appropriately
Aleksej Efimov
aleksej.efimov at oracle.com
Wed Jun 4 11:43:05 UTC 2014
Thank you David.
Thank you Staffan.
On 06/04/2014 01:46 PM, David Holmes wrote:
> Me too.
>
> Thanks,
> David
>
> On 4/06/2014 5:23 PM, Staffan Larsen wrote:
>> Looks ok to me.
>>
>> /Staffan
>>
>> On 3 jun 2014, at 15:49, Aleksej Efimov <aleksej.efimov at oracle.com>
>> wrote:
>>
>>> Staffan, David,
>>>
>>> Returning back to our WaitForMultipleObjects()/WAIT_ABANDONED
>>> discussions:
>>> Thank you for your comments and remarks. I can't disagree with
>>> motivation that it's better to have a fatal error during the
>>> incorrect mutex handling then data corruption (the consequence of
>>> the previous fix). In case of such error it'll be much more easier
>>> to debug/catch it (but as Staffan said - we have tried to check all
>>> call paths and don't think that we'll encounter such behavior).
>>> I have modified the discussed code according to your suggestions:
>>> http://cr.openjdk.java.net/~aefimov/8032901/9/webrev.01/
>>> To abort the process the 'exitTransportWithError' function was
>>> utilized.
>>> Also I have tried to check that behavior isn't changed by running
>>> "svc" regression tests set. There was no related test failures
>>> observed.
>>>
>>> Best Regards,
>>> Aleksej
>>>
>>> On 05/15/2014 01:11 PM, Staffan Larsen wrote:
>>>> On 15 maj 2014, at 03:48, David Holmes <david.holmes at oracle.com>
>>>> wrote:
>>>>
>>>>> On 14/05/2014 11:18 PM, Aleksej Efimov wrote:
>>>>>> David, Vitaly,
>>>>>>
>>>>>> I totally agree with Vitaly's explanation (Vitaly - thank you for
>>>>>> that)
>>>>>> and code in shmemBase.c (the usage of enterMutex() function for
>>>>>> sending/receiving bytes through shared memory connection)
>>>>>> illustrates on
>>>>>> how the connection shutdown event is used as a "cancellation token".
>>>>> Thanks for clarifying that.
>>>>>
>>>>> So if we were to encounter an abandoned mutex the code would
>>>>> presently have acquired the mutex but return an error, thus
>>>>> preventing a subsequent release, and preventing other threads from
>>>>> acquiring (but allowing current thread to recurisevely acquire. So
>>>>> this could both hang and cause data corruption.
>>>>>
>>>>> The new code will still return an error but release the mutex. So
>>>>> no more hangs (other than by conditions caused by data corruption)
>>>>> but more opportunity for data corruption.
>>>>>
>>>>> Obviously it depends on exactly what happens in the critical
>>>>> sections guarded by this mutex, but in general I don't agree with
>>>>> this rationale for making the change:
>>>>>
>>>>> 204 /* If the mutex is abandoned we want to return an error
>>>>> 205 * and also release the mutex so that we don't cause
>>>>> 206 * other threads to be blocked. If a mutex was abandoned
>>>>> 207 * we are in scary state. Data may be corrupted or
>>>>> inconsistent
>>>>> 208 * but it is still better to let other threads run (and
>>>>> possibly
>>>>> 209 * crash) than having them blocked (on the premise that a
>>>>> crash
>>>>> 210 * is always easier to debug than a hang).
>>>>>
>>>>> Considering something has to have gone drastically wrong for the
>>>>> mutex to become abandoned, I'm more inclined to consider this a
>>>>> fatal error and abort.
>>>>>
>>>>> But I'll let the serviceability folk chime in here.
>>>> I was involved in fixing this and writing the comment, so obviously
>>>> I thought it a good solution :-)
>>>>
>>>> I do agree that it would probably be a good idea to consider this a
>>>> fatal error and abort. At that point in the code we don’t have any
>>>> really nice ways of doing that, though. We could just print an
>>>> error and call abort(). What we are doing now is returning an error
>>>> from sysIPMutexEnter() and delegating the error handling to the
>>>> caller. We have tried to check all call paths to verify that they
>>>> do “the right thing” in the face of an error. It is obviously hard
>>>> to verify, but it looks like they all terminate the connection with
>>>> some kind of error message.
>>>>
>>>> /Staffan
>>>>
>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>
>>>>>> Thank you,
>>>>>> -Aleksej
>>>>>>
>>>>>>
>>>>>> On 05/14/2014 01:05 PM, David Holmes wrote:
>>>>>>> On 14/05/2014 11:06 AM, Vitaly Davidovich wrote:
>>>>>>>> In windows, you acquire a mutex by waiting on it using one of
>>>>>>>> the wait
>>>>>>>> functions, one of them employed in the code in question. If
>>>>>>>> WaitForMultipleObjects succeeds and returns the index of the
>>>>>>>> mutex,
>>>>>>>> current thread has ownership now.
>>>>>>> Yes I understand the basic mechanics :)
>>>>>>>
>>>>>>>> It's also common to use multi wait functions where the event is a
>>>>>>>> "cancelation token", e.g. manual reset event; this allows
>>>>>>>> someone to
>>>>>>>> cancel waiting on mutex acquisition and return from the wait
>>>>>>>> function.
>>>>>>>> Presumably that's the case here, but I'll let Aleksej confirm;
>>>>>>>> just
>>>>>>>> wanted to throw this out there in the meantime :).
>>>>>>> Ah I see - yes cancellable lock acquisition would make sense.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>>
>>>>>>>> Sent from my phone
>>>>>>>>
>>>>>>>> On May 13, 2014 6:46 PM, "David Holmes" <david.holmes at oracle.com
>>>>>>>> <mailto:david.holmes at oracle.com>> wrote:
>>>>>>>>
>>>>>>>> Hi Aleksej,
>>>>>>>>
>>>>>>>> Thanks for the doc references regarding abandonment.
>>>>>>>>
>>>>>>>> Let me rephrase my question. What is this logic trying to
>>>>>>>> achieve by
>>>>>>>> waiting on both a mutex and an event? Do we already own the
>>>>>>>> mutex
>>>>>>>> when this function is called?
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> On 13/05/2014 11:19 PM, Aleksej Efimov wrote:
>>>>>>>>
>>>>>>>> David,
>>>>>>>>
>>>>>>>> The Windows has a different terminology for mutex
>>>>>>>> objects (much
>>>>>>>> differs
>>>>>>>> from the POSIX one). This one link gave me some
>>>>>>>> understanding of
>>>>>>>> it [1].
>>>>>>>>
>>>>>>>> Here is the MSDN [1] description of what "abandoned
>>>>>>>> mutex" is:
>>>>>>>> " If a thread terminates without releasing its
>>>>>>>> ownership of a
>>>>>>>> mutex
>>>>>>>> object, the mutex object is considered to be abandoned. A
>>>>>>>> waiting thread
>>>>>>>> can acquire ownership of an abandoned mutex object, but
>>>>>>>> the wait
>>>>>>>> function will return*WAIT_ABANDONED*to indicate that
>>>>>>>> the mutex
>>>>>>>> object is
>>>>>>>> abandoned. An abandoned mutex object indicates that an
>>>>>>>> error has
>>>>>>>> occurred and that any shared resource being protected
>>>>>>>> by the
>>>>>>>> mutex
>>>>>>>> object is in an undefined state. If the thread proceeds as
>>>>>>>> though the
>>>>>>>> mutex object had not been abandoned, it is no longer
>>>>>>>> considered
>>>>>>>> abandoned after the thread releases its ownership. This
>>>>>>>> restores
>>>>>>>> normal
>>>>>>>> behavior if a handle to the mutex object is subsequently
>>>>>>>> specified in a
>>>>>>>> wait function."
>>>>>>>>
>>>>>>>>
>>>>>>>> What does it mean to wait on mutex and ownership of the
>>>>>>>> mutex
>>>>>>>> object:
>>>>>>>> "Any thread with a handle to a mutex object can use one of
>>>>>>>> thewait
>>>>>>>> functions
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687069%28v=vs.85%29.aspx
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687069%28v=vs.85%29.aspx>>to
>>>>>>>>
>>>>>>>> request ownership of the mutex object. If the mutex
>>>>>>>> object is
>>>>>>>> owned by
>>>>>>>> another thread, the wait function blocks the requesting
>>>>>>>> thread
>>>>>>>> until the
>>>>>>>> owning thread releases the mutex object using
>>>>>>>> the*ReleaseMutex*
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms685066%28v=vs.85%29.aspx
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685066%28v=vs.85%29.aspx>>__function."
>>>>>>>>
>>>>>>>>
>>>>>>>> How we can release mutex and wait on already owned mutex:
>>>>>>>> " After a thread obtains ownership of a mutex, it can
>>>>>>>> specify
>>>>>>>> the same
>>>>>>>> mutex in repeated calls to the wait-functions
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687069%28v=vs.85%29.aspx
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687069%28v=vs.85%29.aspx>>__without
>>>>>>>>
>>>>>>>> blocking its execution. This prevents a thread from
>>>>>>>> deadlocking
>>>>>>>> itself
>>>>>>>> while waiting for a mutex that it already owns. To
>>>>>>>> release its
>>>>>>>> ownership
>>>>>>>> under such circumstances, the thread must
>>>>>>>> call*ReleaseMutex*
>>>>>>>> <http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms685066%28v=vs.85%29.aspx
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms685066%28v=vs.85%29.aspx>>__once
>>>>>>>>
>>>>>>>> for each time that the mutex satisfied the conditions
>>>>>>>> of a wait
>>>>>>>> function."
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms684266(v=vs.85).aspx
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms684266(v=vs.85).aspx>
>>>>>>>>
>>>>>>>>
>>>>>>>> -Aleksej
>>>>>>>>
>>>>>>>> On 05/13/2014 04:00 PM, David Holmes wrote:
>>>>>>>>
>>>>>>>> I don't understand this one at all. What is an
>>>>>>>> "abandoned
>>>>>>>> mutex"? For
>>>>>>>> that matter why does the code wait on a mutex and
>>>>>>>> an event?
>>>>>>>> Do we
>>>>>>>> already own the mutex? If so what does it mean to
>>>>>>>> wait on
>>>>>>>> it? If not
>>>>>>>> then how can we release it?
>>>>>>>>
>>>>>>>> ???
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>>
>>>>>>>>
>>>>>>>> On 13/05/2014 8:57 PM, Alan Bateman wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> This is debugger's shared memory transport so
>>>>>>>> cc'ing
>>>>>>>> serviceability-dev
>>>>>>>> as that is there this code is maintained.
>>>>>>>>
>>>>>>>> Is there a test case or any outline of the
>>>>>>>> conditions
>>>>>>>> that cause this? I
>>>>>>>> think that would be useful to understand the issue
>>>>>>>> further.
>>>>>>>>
>>>>>>>> -Alan
>>>>>>>>
>>>>>>>> On 13/05/2014 11:46, Aleksej Efimov wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Can I have a review for 8032901 bug [1] fix
>>>>>>>> [2].
>>>>>>>> There is a possible
>>>>>>>> case when 'WaitForMultipleObjects' function
>>>>>>>> can
>>>>>>>> return the
>>>>>>>> WAIT_ABANDONED_0 [3] error value.
>>>>>>>> In such case it's better to release the
>>>>>>>> mutex and
>>>>>>>> return error value.
>>>>>>>> This will prevent other threads to be
>>>>>>>> blocked on
>>>>>>>> abandoned mutex.
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Aleksej
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://bugs.openjdk.java.net/__browse/JDK-8032901
>>>>>>>> <https://bugs.openjdk.java.net/browse/JDK-8032901>
>>>>>>>> [2]
>>>>>>>> http://cr.openjdk.java.net/~__aefimov/8032901/9/webrev.00/
>>>>>>>> <http://cr.openjdk.java.net/~aefimov/8032901/9/webrev.00/>
>>>>>>>> [3]
>>>>>>>> http://msdn.microsoft.com/en-__gb/library/windows/desktop/__ms687025(v=vs.85).aspx
>>>>>>>>
>>>>>>>> <http://msdn.microsoft.com/en-gb/library/windows/desktop/ms687025(v=vs.85).aspx>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>
>>
More information about the serviceability-dev
mailing list