RFR 8069048: (process) Suspend finishing threads when process exits [win]

Fri Jan 16 07:26:27 UTC 2015

Thank you Daniel for your review!

On 16.01.2015 3:01, Daniel D. Daugherty wrote:
> On 1/15/15 5:09 AM, Ivan Gerasimov wrote:
>> Hello everyone!
>>
>> This is yet another iteration in the attempt to solve the 'wrong exit 
>> code' bug on Windows [1].
>> The wrong exit code has been observed once again with one of the 
>> regression tests.
>>
>> The suspicion is that this might be due to the critical section had 
>> been destroyed before all the children threads were terminated.
>> In that case, one of the threads had been unblocked and called 
>> _endthreadex(), which resulted in a race.
>>
>> To address this scenario, it is proposed to make the thread that is 
>> about to exit the process raise a flag.
>> If the flag is raised, any threads wishing to exit should instead 
>> suspend themselves.
>>
>> BUGURL: https://bugs.openjdk.java.net/browse/JDK-8069048
>> WEBREV: http://cr.openjdk.java.net/~igerasim/8069048/0/webrev/
>
> src/os/windows/vm/os_windows.cpp
>     line 3895: // don't let the current thread to proceed to 
> _endthreadex()
>         Typo: 'let the current thread to proceed to'
>            -> 'let the current thread proceed to'
>
>     Just making sure that I understand the revised algorithm:
>
>     - before the EPT_PROCESS thread gets here, EPT_THREAD threads
>       will work as before and call line 3909 _endthreadex()
>
>     - after the EPT_PROCESS thread gets here and sets the flag
>       on line 3886: OrderAccess::release_store(&process_exiting, 1);
>
>       - an EPT_THREAD thread may have made it past flag check on line
>         3802: } else if (OrderAccess::load_acquire(&process_exiting) 
> == 0) {
>         but it will be blocked on line 3803: 
> EnterCriticalSection(&crit_sect);
>
>       - an EPT_THREAD thread that sees the flag set on line 3802 will
>         drop into the self-suspend block on lines 3892-3900
>
>     - after the EPT_PROCESS thread exits the critical section, then
>       any EPT_THREAD threads that were blocked trying to acquire
>       the critical section will now see the flag set on line 3805:
>       if (what == EPT_THREAD && 
> OrderAccess::load_acquire(&process_exiting) == 0) {
>       and drop into the self-suspend block on lines 3892-3900
>
>     Short version: any EPT_THREAD threads that arrive after the
>     EPT_PROCESS thread owns the critical section will never call
>     line 3909 _endthreadex() because they self-suspend.
>

Yes, the logic is meant to be precisely as you're describing.

> OK, I concur that this new algorithm looks correct and will reduce
>     the number of threads racing through line 3909 _endthreadex() while
>     the EPT_PROCESS thread is trying to call exit().
>
>     One possible hole remains that we've discussed before. If an
>     EPT_THREAD thread calls _endthreadex() before the EPT_PROCESS
>     thread gets here, and if the EPT_THREAD thread stalls in
>     _endthreadex(), then it's still possible for that EPT_THREAD
>     thread to mess up the exit code if it unblocks after the
>     EPT_PROCESS thread has set the exit code.
>
The EPT_PROCESS thread is waiting for those EPT_THREAD threads that had 
called _endthreadex() at
3874 res = WaitForMultipleObjects(handle_count, handles, TRUE, 
EXIT_TIMEOUT).
This can timeout , of course, so yes, the window for a race is still open.

> We've discussed this
>     before and there's nothing we can do about other than try and
>     reduce the probability by reducing the number of EPT_THREAD
>     threads that are calling _endthreadex().
>
>     Thumbs up!
>
>
> Side note: A new failure of this mechanism was filed recently:
>
> JDK-8069068 VM warning: WaitForMultipleObjects timed out (0) ...
> https://bugs.openjdk.java.net/browse/JDK-8069068
>
>     The bug was filed against JDK9-B45 so it has the most recent
>     fix (https://bugs.openjdk.java.net/browse/JDK-8066863). The
>     weird part is that WaitForMultipleObjects() timed out without
>     an error code being set. Don't know if that means anything in
>     particular in the Win* APIS...
>
>     This fix (8069048) may also reduce the probability of this
>     failure mode because we'll be queueing fewer threads on the
>     handle list...
>
Right. If this failure had happened during the app termination, the new 
logic might have helped.

It still looks surprising though, as the warning indicates that none of 
64 threads that have already called _endthreadex() could not complete 
execution during 4 second!
And one of those threads had the priority above normal.
Very strange. I need to try to reproduce this kind of failure locally.

Sincerely yours,
Ivan