RFR 8059533: (process) Make exiting process wait for exiting threads [win]

Mon Oct 27 08:35:42 UTC 2014

On 27.10.2014 3:36, David Holmes wrote:
> On 27/10/2014 1:15 AM, Ivan Gerasimov wrote:
>> David, would you approve this fix?
>
> Sorry Ivan I'm having trouble following the logic this time - could 
> you add some comments about what we are checking at each step.

Yes, sure.

The main idea is to make the thread that ends the process wait for the 
threads that had finished so far.
Thus, we have an array for storing the thread handles.
Any thread that is on thread-exit path, first tries to remove the 
completed threads from the array (to keep the list smaller), and then 
adds its own handle to the end of the array.
The thread that is on process-exit path, calls exit (or _exit), while 
still owning the critical section.
This way we make sure, no other threads execute any exit-related code at 
the same time.

Here's a typical scenario:
1) First thread that decided to end itself calls 
exit_process_or_thread() -- let's assume it is on thread-exit path.
Initializes the critical section.
2) Grabs the ownership of the crit. section
3) The list of thread handles is initially empty, so the thread adds a 
duplicate of its handle to the array.
4) Releases the crit. section
5) Calls _endthreadex() to terminate itself

6) Another thread enters exit_process_or_thread() -- let it be on 
thread-exit path as well.
7) Grabs the ownership of the crit. section
8) In a loop checks if any previously ended thread has completed.
Here we call WaitForSingleObject with zero timeout, so we don't block.
All the handles of completed threads are closed.
9) If there's is a free slot in the array, the thread adds its handle to 
the end
10) If the array is full (which is very unlikely), the thread waits for 
ANY thread to complete, and then adds itself to the array.
11) Releases the crit. section
12) Calls _endthreadex() to terminate itself

13) Some thread enters exit_process_or_thread() in order to end the 
whole process.
14) Grabs the ownership of the crit. section
15) Waits on all the threads that have added their handles to the array 
(typically there will be only one such thread handle).
Since the ownership of the critical section is held, no other threads 
will execute any exit-related code at this time.
16) Once all the threads from the list have completed, the thread closes 
the handles and calls exit() (or _exit()), holding the crit. section 
ownership.

We're done.

Error handling: in a case of errors, we report them, and proceed with 
exiting as usual.
- If initialization of critical section fails, we'll just call the 
corresponding exit routine.
- If we failed, waiting for an exiting thread to complete, close its 
handle as if it has completed.
- If we failed, waiting for any thread to complete withing a time-out 
(array is full), close all the handles and continue as if there were no 
threads exited before.
- If we couldn't duplicate the handle, ignore it (don't add it to the 
array), so no one will wait for it later.
- If the thread on the process-exit path failed to wait for the threads 
to complete withing the time-out, proceed to the exit anyway.

All these errors should never happen during normal execution, but if 
they do, we still try to end threads/process in a way it's done now.
In this, later case, we are at risk of observing a race condition.
However, the chances of this happening are much lesser, and in addition 
we'll have a waring message to analyze.

Possible bottlenecks.
1) All the threads have to obtain the ownership of the critical section, 
which effectively serializes all the exiting threads.
However, this doesn't appear to make things too much slower, as all the 
threads already do similar thing in _endthreadex().
2) Normally, the threads don't block having ownership of the crit. section.
The block can only happen if there's no free slot in the array of handles.
This can only happen if MAX_EXIT_HANDLES (== 16) threads have just 
called _endthreadex(), and none of them completed.
3) When the thread at process-exit path waits for all the exiting 
threads to complete, the time-out of 1 second is specified.
If any of those threads do not complete, this can lead to that the 
application is delayed at the exit.
However, we don't block forever, and the delay can only be observed upon 
a failure.

> Also we seem to exit while still holding the critical section - how 
> does that work?
>
Right.
We make the thread at the process-exit path call exit() from withing 
critical section block.
This way it is ensured no other exit-related code is executed at the 
same moment, and a race is avoided.

Sincerely yours,
Ivan

> Thanks,
> David
>
>> Sincerely yours,
>> Ivan
>>
>> On 26.10.2014 19:01, Daniel D. Daugherty wrote:
>>> On 10/25/14 12:23 PM, Ivan Gerasimov wrote:
>>>>
>>>> On 25.10.2014 3:06, Daniel D. Daugherty wrote:
>>>>> On 10/1/14 3:07 AM, Ivan Gerasimov wrote:
>>>>>> Hello!
>>>>>>
>>>>>> The tests that continue to fail with wrong exit codes suggest that
>>>>>> the fix for JDK-8057744 wasn't sufficient.
>>>>>> Here's another proposal, which expands the synchronized portion of
>>>>>> the code.
>>>>>> It is proposed to make the exiting process wait for the threads
>>>>>> that have already started exiting.
>>>>>> This should help to make sure that no thread is executing any
>>>>>> potentially racy code concurrently with the exiting process.
>>>>>>
>>>>>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8059533
>>>>>> WEBREV: http://cr.openjdk.java.net/~igerasim/8059533/0/webrev/
>>>>>
>>>>> Finally got a chance to look at the official version of fix.
>>>>>
>>>>> Thumbs up!
>>>>>
>>>>> src/os/windows/vm/os_windows.cpp
>>>>>     No comments.
>>>>>
>>>> Thank you Daniel!
>>>>
>>>> I assume the change needs the second hotspot reviewer?
>>>
>>> Yes, HotSpot changes always need two reviewers. David Holmes
>>> chimed in on this thread. You should ask him if he can be
>>> counted as a reviewer.
>>>
>>>
>>>> What would be the best time for pushing this fix?
>>>
>>> Let's go for Wednesday again so we have a full week of testing
>>> to evaluate this latest tweak.
>>>
>>> Dan
>>>
>>>
>>>>
>>>> Sincerely yours,
>>>> Ivan
>>>>
>>>>> Dan
>>>>>
>>>>> P.S.
>>>>> We had another sighting of an exit_code == 60115 test failure
>>>>> this past week so while your previous fix greatly reduced the
>>>>> odds of this race, I'm looking forward to seeing this new
>>>>> version in action...
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Comments, suggestion are welcome!
>>>>>>
>>>>>> Sincerely yours,
>>>>>> Ivan
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>