RFR 8059533: (process) Make exiting process wait for exiting threads [win]

Wed Oct 29 09:24:25 UTC 2014

Thanks David for the comments!

On 28.10.2014 6:06, David Holmes wrote:
> Thanks for the explanation Ivan, but I don't see any comments added to 
> the code.
>
I've updated the webrev to include the comment that explain the logic of 
the workaround
http://cr.openjdk.java.net/~igerasim/8059533/1/webrev/

Would you please take a look to see if they look sensible?

> So this latest fix all hinges on whether the part of the exit logic 
> that corrupts the process exit value, happens before or after the 
> logic that will cause a thread waiting on a terminating thread's 
> handle to unblock. If after then we still have the race - but at least 
> the exiting thread has a slight head start over the process 
> terminating thread.
>

I have a couple of thought about this:

First, the thread that ends the process waits for all the exiting 
threads to complete. By the time it returns from WaitForMultipleObjects, 
those exiting threads aren't running anymore, so the race is avoided 
(unless it's the race with scheduler).

Second, the previous attempt, which utilized abandoning mutexes, reduced 
the frequency of this bug occurrences.
The latest proposal extends the portion of synchronized code (I've 
checked that abandoning the mutex owned by an exiting thread normally 
happens before WaitFor(... exiting thread...) returns).
Thus, if the race had happened during this time window, it should be 
eliminated now.

Sincerely yours,
Ivan

> Thanks,
> David
>
> On 27/10/2014 6:35 PM, Ivan Gerasimov wrote:
>>
>> On 27.10.2014 3:36, David Holmes wrote:
>>> On 27/10/2014 1:15 AM, Ivan Gerasimov wrote:
>>>> David, would you approve this fix?
>>>
>>> Sorry Ivan I'm having trouble following the logic this time - could
>>> you add some comments about what we are checking at each step.
>>
>> Yes, sure.
>>
>> The main idea is to make the thread that ends the process wait for the
>> threads that had finished so far.
>> Thus, we have an array for storing the thread handles.
>> Any thread that is on thread-exit path, first tries to remove the
>> completed threads from the array (to keep the list smaller), and then
>> adds its own handle to the end of the array.
>> The thread that is on process-exit path, calls exit (or _exit), while
>> still owning the critical section.
>> This way we make sure, no other threads execute any exit-related code at
>> the same time.
>>
>> Here's a typical scenario:
>> 1) First thread that decided to end itself calls
>> exit_process_or_thread() -- let's assume it is on thread-exit path.
>> Initializes the critical section.
>> 2) Grabs the ownership of the crit. section
>> 3) The list of thread handles is initially empty, so the thread adds a
>> duplicate of its handle to the array.
>> 4) Releases the crit. section
>> 5) Calls _endthreadex() to terminate itself
>>
>> 6) Another thread enters exit_process_or_thread() -- let it be on
>> thread-exit path as well.
>> 7) Grabs the ownership of the crit. section
>> 8) In a loop checks if any previously ended thread has completed.
>> Here we call WaitForSingleObject with zero timeout, so we don't block.
>> All the handles of completed threads are closed.
>> 9) If there's is a free slot in the array, the thread adds its handle to
>> the end
>> 10) If the array is full (which is very unlikely), the thread waits for
>> ANY thread to complete, and then adds itself to the array.
>> 11) Releases the crit. section
>> 12) Calls _endthreadex() to terminate itself
>>
>> 13) Some thread enters exit_process_or_thread() in order to end the
>> whole process.
>> 14) Grabs the ownership of the crit. section
>> 15) Waits on all the threads that have added their handles to the array
>> (typically there will be only one such thread handle).
>> Since the ownership of the critical section is held, no other threads
>> will execute any exit-related code at this time.
>> 16) Once all the threads from the list have completed, the thread closes
>> the handles and calls exit() (or _exit()), holding the crit. section
>> ownership.
>>
>> We're done.
>>
>> Error handling: in a case of errors, we report them, and proceed with
>> exiting as usual.
>> - If initialization of critical section fails, we'll just call the
>> corresponding exit routine.
>> - If we failed, waiting for an exiting thread to complete, close its
>> handle as if it has completed.
>> - If we failed, waiting for any thread to complete withing a time-out
>> (array is full), close all the handles and continue as if there were no
>> threads exited before.
>> - If we couldn't duplicate the handle, ignore it (don't add it to the
>> array), so no one will wait for it later.
>> - If the thread on the process-exit path failed to wait for the threads
>> to complete withing the time-out, proceed to the exit anyway.
>>
>> All these errors should never happen during normal execution, but if
>> they do, we still try to end threads/process in a way it's done now.
>> In this, later case, we are at risk of observing a race condition.
>> However, the chances of this happening are much lesser, and in addition
>> we'll have a waring message to analyze.
>>
>> Possible bottlenecks.
>> 1) All the threads have to obtain the ownership of the critical section,
>> which effectively serializes all the exiting threads.
>> However, this doesn't appear to make things too much slower, as all the
>> threads already do similar thing in _endthreadex().
>> 2) Normally, the threads don't block having ownership of the crit. 
>> section.
>> The block can only happen if there's no free slot in the array of 
>> handles.
>> This can only happen if MAX_EXIT_HANDLES (== 16) threads have just
>> called _endthreadex(), and none of them completed.
>> 3) When the thread at process-exit path waits for all the exiting
>> threads to complete, the time-out of 1 second is specified.
>> If any of those threads do not complete, this can lead to that the
>> application is delayed at the exit.
>> However, we don't block forever, and the delay can only be observed upon
>> a failure.
>>
>>
>>> Also we seem to exit while still holding the critical section - how
>>> does that work?
>>>
>> Right.
>> We make the thread at the process-exit path call exit() from withing
>> critical section block.
>> This way it is ensured no other exit-related code is executed at the
>> same moment, and a race is avoided.
>>
>> Sincerely yours,
>> Ivan
>>
>>> Thanks,
>>> David
>>>
>>>> Sincerely yours,
>>>> Ivan
>>>>
>>>> On 26.10.2014 19:01, Daniel D. Daugherty wrote:
>>>>> On 10/25/14 12:23 PM, Ivan Gerasimov wrote:
>>>>>>
>>>>>> On 25.10.2014 3:06, Daniel D. Daugherty wrote:
>>>>>>> On 10/1/14 3:07 AM, Ivan Gerasimov wrote:
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> The tests that continue to fail with wrong exit codes suggest that
>>>>>>>> the fix for JDK-8057744 wasn't sufficient.
>>>>>>>> Here's another proposal, which expands the synchronized portion of
>>>>>>>> the code.
>>>>>>>> It is proposed to make the exiting process wait for the threads
>>>>>>>> that have already started exiting.
>>>>>>>> This should help to make sure that no thread is executing any
>>>>>>>> potentially racy code concurrently with the exiting process.
>>>>>>>>
>>>>>>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8059533
>>>>>>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8059533/0/webrev/
>>>>>>>
>>>>>>> Finally got a chance to look at the official version of fix.
>>>>>>>
>>>>>>> Thumbs up!
>>>>>>>
>>>>>>> src/os/windows/vm/os_windows.cpp
>>>>>>>     No comments.
>>>>>>>
>>>>>> Thank you Daniel!
>>>>>>
>>>>>> I assume the change needs the second hotspot reviewer?
>>>>>
>>>>> Yes, HotSpot changes always need two reviewers. David Holmes
>>>>> chimed in on this thread. You should ask him if he can be
>>>>> counted as a reviewer.
>>>>>
>>>>>
>>>>>> What would be the best time for pushing this fix?
>>>>>
>>>>> Let's go for Wednesday again so we have a full week of testing
>>>>> to evaluate this latest tweak.
>>>>>
>>>>> Dan
>>>>>
>>>>>
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Ivan
>>>>>>
>>>>>>> Dan
>>>>>>>
>>>>>>> P.S.
>>>>>>> We had another sighting of an exit_code == 60115 test failure
>>>>>>> this past week so while your previous fix greatly reduced the
>>>>>>> odds of this race, I'm looking forward to seeing this new
>>>>>>> version in action...
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Comments, suggestion are welcome!
>>>>>>>>
>>>>>>>> Sincerely yours,
>>>>>>>> Ivan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>