RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()

Reingruber, Richard richard.reingruber at sap.com
Tue Aug 11 09:50:13 UTC 2020


Hi David,

thanks for looking at this. I've prepared a new webrev based on your feedback.

Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3/
Delta:  http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3.inc/

I'm answering your points inline below.

Thanks,
Richard.

> On 31/07/2020 5:28 pm, Reingruber, Richard wrote:
> > Hi,
> > 
> > I rebase the fix after JDK-8250042.
> > 
> > New webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.2/

> The general fix for this seems good. A minor nit:

>   588     if (!is_assignable(signature, ob_k, Thread::current())) {

> You know that the current thread is the VMThread so can use 
> VMThread::vm_thread().

> Similarly for this existing code:

>   694     Thread* current_thread = Thread::current();

> ---

Done.

> Looking at the test code ... I'm less clear on exactly what is happening

The @comment in GetLocalWithoutSuspendTest.java describes the intent.

> and the use of spin-waits raises some red-flags for me in terms of test 
> reliability on different platforms. The "while (--waitCycles > 0)" loop 
> in particular offers no certainty that the agent thread is executing 
> anything in particular. And the use of the spin_count as a guide to 
> future waiting time seems somewhat arbitrary. In all seriousness I got 
> a headache trying to work out how the test was expecting to operate. 
> Some parts could be simplified using raw monitors, I think. But there's 
> no sure way to know the agent thread is in the midst of the stackwalk 
> when the target thread wants to leave the native code. So I understand 
> what you are trying to achieve here, I'm just not sure how reliably it 
> will actually achieve it.

I've replaced 2 of 3 spin waits with wait calls on a raw monitor. The state of
the test is kept in the global variable test_state. Agent and target thread use
it to synchronize. I hope this helps to make clear what is happening.

As you noticed the remaining spin wait cannot be replaced with a monitor
wait. It should not be eliminated either, though. Without it the target thread
might always return from the native call before the unsafe stack walk was
started. I guess you have noticed that the target thread doubles the spin cycles
in each iteration searching the sweet spot :)

About reliability: on my Windows notebook and on a Linux server the test has
never succeeded without the fix. The probability for failure could be increased
by using a thread with a larger stack, but I don't think this is necessary.

On uniprocessor systems the test might not work very reliably. I experimented
with pinning the jvm to one cpu. Executed like this the test reliably fails
without the fix as well, though.

Without modification of the jvm the test cannot be made 100% reliable. But there
are many tests like this I reckon.

> test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/libGetLocalWithoutSuspendTest.cpp

>   32 static volatile jlong spinn_count     = 0;

> Using a 64-bit counter seems like it will be a problem on 32-bit systems.

> Should be spin_count not spinn_count.

This is actually a dummy variable. I have renamed it to dummy_counter. Its
purpose is to prevent the c++ compiler from eliminating the spin wait loop.
I have replaced all longs with ints in the test.


-----Original Message-----
From: David Holmes <david.holmes at oracle.com> 
Sent: Montag, 10. August 2020 07:35
To: Reingruber, Richard <richard.reingruber at sap.com>; serguei.spitsyn at oracle.com; serviceability-dev at openjdk.java.net
Subject: Re: RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()

Hi Richard,

On 31/07/2020 5:28 pm, Reingruber, Richard wrote:
> Hi,
> 
> I rebase the fix after JDK-8250042.
> 
> New webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.2/

The general fix for this seems good. A minor nit:

  588     if (!is_assignable(signature, ob_k, Thread::current())) {

You know that the current thread is the VMThread so can use 
VMThread::vm_thread().

Similarly for this existing code:

  694     Thread* current_thread = Thread::current();

---

Looking at the test code ... I'm less clear on exactly what is happening 
and the use of spin-waits raises some red-flags for me in terms of test 
reliability on different platforms. The "while (--waitCycles > 0)" loop 
in particular offers no certainty that the agent thread is executing 
anything in particular. And the use of the spin_count as a guide to 
future waiting time seems somewhat arbitrary. In all seriousness I got 
a headache trying to work out how the test was expecting to operate. 
Some parts could be simplified using raw monitors, I think. But there's 
no sure way to know the agent thread is in the midst of the stackwalk 
when the target thread wants to leave the native code. So I understand 
what you are trying to achieve here, I'm just not sure how reliably it 
will actually achieve it.

test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/libGetLocalWithoutSuspendTest.cpp

  32 static volatile jlong spinn_count     = 0;

Using a 64-bit counter seems like it will be a problem on 32-bit systems.

Should be spin_count not spinn_count.

  36 // Agent thread waits for value != 0, then performas the JVMTI call 
to get local variable.

typo: performas

Thanks,
David
-----

> Thanks, Richard.
> 
> -----Original Message-----
> From: serviceability-dev <serviceability-dev-retn at openjdk.java.net> On Behalf Of Reingruber, Richard
> Sent: Montag, 27. Juli 2020 09:45
> To: serguei.spitsyn at oracle.com; serviceability-dev at openjdk.java.net
> Subject: [CAUTION] RE: RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()
> 
> Hi Serguei,
> 
>    > I tested it on Linux and Windows but not yet on MacOS.
> 
> The test succeeded now on all platforms.
> 
> Thanks, Richard.
> 
> -----Original Message-----
> From: Reingruber, Richard
> Sent: Freitag, 24. Juli 2020 15:04
> To: serguei.spitsyn at oracle.com; serviceability-dev at openjdk.java.net
> Subject: RE: RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()
> 
> Hi Serguei,
> 
>> The fix itself looks good to me.
> 
> thanks for looking at the fix.
> 
>> I still need another look at new test.
>> Could you, please, convert the agent of new test to C++?
>> It will make it a little bit simpler.
> 
> Sure, here is the new webrev.1 with a C++ version of the test agent:
> 
> http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.1/
> 
> I tested it on Linux and Windows but not yet on MacOS.
> 
> Thanks,
> Richard.
> 
> -----Original Message-----
> From: serguei.spitsyn at oracle.com <serguei.spitsyn at oracle.com>
> Sent: Freitag, 24. Juli 2020 00:00
> To: Reingruber, Richard <richard.reingruber at sap.com>; serviceability-dev at openjdk.java.net
> Subject: Re: RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()
> 
> Hi Richard,
> 
> Thank you for filing the CR and taking care about it!
> The fix itself looks good to me.
> I still need another look at new test.
> Could you, please, convert the agent of new test to C++?
> It will make it a little bit simpler.
> 
> Thanks,
> Serguei
> 
> 
> On 7/20/20 01:15, Reingruber, Richard wrote:
>> Hi,
>>
>> please help review this fix for VM_GetOrSetLocal. It moves the unsafe stackwalk from the vm
>> operation prologue before the safepoint into the doit() method executed at the safepoint.
>>
>> Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.0/index.html
>> Bug:    https://bugs.openjdk.java.net/browse/JDK-8249293
>>
>> According to the JVMTI spec on local variable access it is not required to suspend the target thread
>> T [1]. The operation will simply fail with JVMTI_ERROR_NO_MORE_FRAMES if T is executing
>> bytecodes. It will succeed though if T is blocked because of synchronization or executing some native
>> code.
>>
>> The issue is that in the latter case the stack walk in VM_GetOrSetLocal::doit_prologue() to prepare
>> the access to the local variable is unsafe, because it is done before the safepoint and it races
>> with T returning to execute bytecodes making its stack not walkable. The included test shows that
>> this can crash the VM if T wins the race.
>>
>> Manual testing:
>>
>>     - new test test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java
>>     - test/hotspot/jtreg/vmTestbase/nsk/jvmti
>>     - test/hotspot/jtreg/serviceability/jvmti
>>
>> Nightly regression tests @SAP: JCK and JTREG, also in Xcomp mode, SPECjvm2008, SPECjbb2015,
>> Renaissance Suite, SAP specific tests with fastdebug and release builds on all platforms
>>
>> Thanks, Richard.
>>
>> [1] https://docs.oracle.com/en/java/javase/14/docs/specs/jvmti.html#local
> 


More information about the serviceability-dev mailing list