RFR(S): 8199811: com/sun/jdi/ProcessAttachTest.java fails intermittently: Remote thread failed for unknown reason

David Holmes david.holmes at oracle.com
Mon Aug 6 07:31:01 UTC 2018


Hi Chris,

On 6/08/2018 2:28 PM, Chris Plummer wrote:
> Hi David,
> 
> On 8/5/18 5:33 PM, David Holmes wrote:
>> Hi Chris,
>>
>> On 4/08/2018 4:23 AM, Chris Plummer wrote:
>>> Hello,
>>>
>>> Please review the following fix for JDK12:
>>>
>>> http://cr.openjdk.java.net/~cjplummer/8199811/webrev.00
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8199811
>>>
>>> The root of the problem is that there is no code in the solaris or 
>>> windows AttachListener support that ensures that the listener is done 
>>> being initialized before attaching and  attempting to enqueue the 
>>> first command. The enqueue operation fails when is sees that the 
>>> listener is not attached yet.
>>>
>>> I was able to force this failure to happen every time by adding a 10 
>>> second sleep in attach_listener_thread_entry() just before the call 
>>> to AttachListener::set_initialized(). This did not cause macosx or 
>>> linux to fail, but did make solaris fail (failures had not been noted 
>>> previously) and windows to fail (failures previously had been 
>>> observed, but very rarely).
>>
>> I'm having trouble seeing the complete code paths here to understand 
>> the control flow for initialization and subsequent use. How do we get 
>> to the enqueue logic (that fails) if the initialization logic has not 
>> yet completed? Is the init logic asynchronous? (If so I would expect 
>> many more failures of this nature.)
> It's seems to be very convoluted on Windows, and somewhat on Solaris 
> too. Took me a while to figure it out enough to fix this issue. Yes, it 
> appears the init logic is async w.r.t. the enqueue logic. On Windows the 
> init logic is always triggered during VM startup. I had previously 
> inserted some pns2() calls to see how it gets initialized. Here's what 
> it produced:
> 
> V  [jvm.dll+0x30b7fd]  AttachListener::init+0x2dd  (attachlistener.cpp:466)
> V  [jvm.dll+0xd59513]  Threads::create_vm+0x873  (thread.cpp:3817)
> V  [jvm.dll+0x862e71]  JNI_CreateJavaVM_inner+0xd1  (jni.cpp:3945)
> V  [jvm.dll+0x8667af]  JNI_CreateJavaVM+0x1f  (jni.cpp:4036)
> C  [java.exe+0x36df]
> C  [java.exe+0x19fcc]
> C  [KERNEL32.DLL+0x13d2]
> C  [ntdll.dll+0x154f4]
> 
> V  [jvm.dll+0x30c473]  AttachListener::pd_init+0x23  
> (attachlistener_windows.cpp:394)
> V  [jvm.dll+0x30a6fc]  attach_listener_thread_entry+0x17c  
> (attachlistener.cpp:352)
> V  [jvm.dll+0xd622ac]  JavaThread::run+0x3ec  (thread.cpp:1732)
> V  [jvm.dll+0xba191e]  thread_native_entry+0x11e  (os_windows.cpp:456)
> C  [ucrtbase.DLL+0x1d885]
> C  [KERNEL32.DLL+0x13d2]
> C  [ntdll.dll+0x154f4]
> 
> So AttachListener::init() starts a new thread with 
> attach_listener_thread_entry() as the entrypoint. That's also the method 
> that eventually calls  AttachListener::set_initialized(), and also where 
> I stuck the 10 second delay before the call to 
> AttachListener::set_initialized() to make it fail every time on Windows 
> and Solaris. AttachListener::init() only starts the listener thread. It 
> does not wait for it to become initialized.

Got it. With init-on-startup you would expect it would not take very 
long for the attach thread to complete initialization but of course 
there is no guarantee at all that it will do so.

> The enqeue() is done in a very strange way on Windows that I don't fully 
> understand. I could never get a native stack trace since it's done from 
> a thread that is not attached to the JVM.  From the attaching side  you 
> get the following exception when the enqueue() fails:
> 
> java.lang.InternalError: Remote thread failed for unknown reason (100)
>      at jdk.attach/sun.tools.attach.VirtualMachineImpl.enqueue(Native 
> Method)
>      at 
> jdk.attach/sun.tools.attach.VirtualMachineImpl.execute(VirtualMachineImpl.java:106) 
> 
>      at 
> jdk.attach/sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:309) 
> 
>      at 
> jdk.attach/sun.tools.attach.HotSpotVirtualMachine.getAgentProperties(HotSpotVirtualMachine.java:202) 
> 
>      at 
> jdk.jdi/com.sun.tools.jdi.ProcessAttachingConnector.attach(ProcessAttachingConnector.java:103) 
> 
>      at ProcessAttachTest.tryDebug(ProcessAttachTest.java:107)
>      at ProcessAttachTest.runTest(ProcessAttachTest.java:87)
>      at ProcessAttachTest.main(ProcessAttachTest.java:66)
>      at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>      at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> 
>      at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
> 
>      at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>      at 
> com.sun.javatest.regtest.agent.MainActionHelper$SameVMRunnable.run(MainActionHelper.java:229) 
> 
>      at java.base/java.lang.Thread.run(Thread.java:834)
> 
> If you look at the windows 
> Java_sun_tools_attach_VirtualMachineImpl_enqueue() method, you'll see 
> how it creates a process in the target VM, and has it execute 
> JVM_EnqueueOperation(), which calls Win32AttachListener::enqueue(), 
> which is where the return code of 100 (ATTACH_ERROR_DISABLED) is 
> produced, and also where I put my sleep code to give initialization a 
> chance to complete.

Yeah I still find this path hard to work through. :)

>>
>>> The proposed fix is to have the enqueue code sleep for up to 20 
>>> seconds (in 1 second intervals) waiting for initialization to be 
>>> complete. I found this fixed the problem, even with the 10 second 
>>> sleep in attach_listener_thread_entry() still in place. A shorter 
>>> sleep is probably fine. I'm open to suggestions. Since this timing 
>>> issue was so rare, my guess is that a single 1 second sleep is likely 
>>> to always fix it, but since it is so hard to reproduce (without the 
>>> 10 second sleep in place), I can't say for sure.
>>
>> That seems reasonable.
>>
>> Not sure what the interruption issue is that you and Gary discussed. 
>> The os-level sleep function can only be interrupted by signals, and 
>> this thread shouldn't be receiving any signals in general.So it's not 
>> something I would be concerned about.
> Ok.
>>
>>> Another approach to fixing this would be to use some sort of 
>>> synchronization between the init and enqueue code, like a condition 
>>> variable. I think I know how to do this with pthread_cond_wait() and 
>>> pthread_cond_signal(), although it gets to be a bit tricky since I'd 
>>> probably have to make the enqueue code create the condvar if 
>>> initialization is not yet complete, and then have the initialization 
>>> code check for the existence of the condvar when initialization is 
>>> complete, and signal on it if it exists. I'm pretty sure there's a 
>>> potential for race condition in there. I haven't thought it through 
>>> enough to say for sure. I also looked a bit at condition variable 
>>> support on windows, and it looks like I could do something similar 
>>> there too. However, I think the sleep approach I have implement is 
>>> far more straight forward and less error prone, so I'd prefer to 
>>> stick with it if others approve.
>>
>> Can't comment on this without understanding exactly where the race is.
> Ok. Let me know if you need any more details.

It would be easy to use a Semaphore to sync the thread start with 
completion of the AttachListener initialization - but that would incur a 
startup penalty that we definitely would not want to do.

Synchronizing the first "enqueue" requires a much more detailed 
understanding of the enqueuing process than what I currently have, or 
have time to glean. So the pseudo-synchronization using sleep seems 
quite adequate.

Thanks,
David

> thanks,
> 
> Chris
>>
>> Thanks,
>> David
>>
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
> 
> 


More information about the serviceability-dev mailing list