RFR 8160892: VM warning: WaitForMultipleObjects timed out

Thu Jul 14 21:31:12 UTC 2016

On 7/14/16 12:46 PM, Ivan Gerasimov wrote:
> Thank you David for looking into this!
>
> Here's the webrev updated in accordance with your and Daniel's 
> suggestions:
> http://cr.openjdk.java.net/~igerasim/8160892/01/webrev/

src/os/windows/vm/os_windows.cpp
     L3900:     bool registered = false;
         Please consider adding a comment above this variable:

         // We only attempt to register threads until a process exiting
         // thread manages to set the process_exiting flag. Any threads
         // that come through here after the process_exiting flag is set
         // are unregistered and will be caught in the SuspendThread()
         // infinite loop below.

         My request is different than David's. I prefer the comment up
         here by the 'registered' variable because that variable is
         key for getting into the SuspendThread() infinite loop. The
         comment also occurs before all the code paths that you have
         trace out with the different Ept types and flags states.

I'm really hoping that this is the last tweak we need to make to
clearly comment what we're doing to alleviate this OS race.
The number of occurrences of the underlying bug keep going down
with every iteration so we're getting close.

Dan

>
> Please see my answers inline
>
>
>> Nit: can we change 'registered_itself" to just "registered" please.
> Done.
>
>>
>> Can you explain under what conditions a thread will now reach the 
>> self-suspension code. Is that only if an error occurred such that we 
>> were unable to register our handle for the process-exiting thread to 
>> wait on? If so some commentary on that block seems appropriate - 
>> perhaps more appropriate there than back up at the place where it 
>> failed to get the handle (as Dan requested).
>>
> There are three kinds of threads, which can be caught in that 
> self-suspension loop:
> 1) All threads that want to end (by calling _endthreadex()) *after* 
> some process-exiting thread raised the flag `process_exiting`.
> The rationale here is that we know that the whole process is going to 
> be terminated quite soon, so we do not allow any thread to call 
> _endthreadex(), which seems to have the concurrency bug.
> 2) Any thread that wants to end the whole process, after some other 
> thread raised the flag `process_exiting`.
> If more than one threads want to end the process, we let to do it only 
> the thread that could raise the flag `process_exiting`.  All other 
> such threads will have to suspend themselves.
> 3) (Unlikely to happen in practice) Any thread that wants to end by 
> calling _endthreadex(), but which failed to register itself due to 
> failure of DuplicateHandle().
> Here we still have a race, which can result in a wrong exit code of 
> the process.
>
>> Given we keep missing conditions I'm only cautiously optimistic about 
>> this.
>> And I'd like to understand how we still sometimes end up exiting with 
>> an "error code" that seems to be the value of an exception! :(
>
> The last time the sentinel exit code =20115 was reported almost a year 
> ago.
> After that the fix for JDK-8145127 had gone in, and I didn't see any 
> more reports about wrong exit codes since then.
> In particular, that fix worked around the situation when more than one 
> threads concurrently call System.exit(), which might have caused a race.
>
> With kind regards,
> Ivan
>
>>
>> Thanks,
>> David
>>
>>> With kind regards,
>>> Ivan
>>>
>>
>