RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()

Reingruber, Richard richard.reingruber at sap.com
Fri Aug 14 14:06:53 UTC 2020


Hi Serguei,

thanks for the feedback. I have implemented your suggestions and created a new
webrev:

Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.4/
Delta:  http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.4.inc/

Please find my replies to your comments below.

Best regards,
Richard.

> http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3.inc/test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java.frames.html

>   33  *          the stack walk. The target thread's stack is walkable while in native. After sending the notification it
>   ...

>   54      * @param depth Depth of target frame for GetLocalObject() call. Should be large value to prolong the unsafe stack walk.
>   55      * @param waitTimeInNativeAfterNotify Time to wait after notify with walkable stack before returning an becoming unsafe again.
>   ...

>   71      * Wait time in native, i.e. with walkable stack, after notifying agent thread to do GetLocalObject() call.
>   ...

>   89             msg((now -start) + " ms  Iteration : " + iterations + "  waitTimeInNativeAfterNotify : " + waitTimeInNativeAfterNotify);

> Could you, please, re-balance the lines above to make them shorter?

Ok, done.


>   90             int newTargetDepth = recursiveMethod(0, targetDepth);
>   91             if (newTargetDepth < targetDepth) {
>   92                 msg("StackOverflowError during test.");
>   93                 msg("Old target depth: " + targetDepth);
>   94                 msg("Retry with new target depth: " + newTargetDepth);
>   95                 targetDepth = newTargetDepth;
>   96             }
> A comment is needed to explain why a StackOverflowError is not desired.
> At least, it is not obvious initially.
>   73     public int waitTimeInNativeAfterNotify;

> This name is unreasonably long which makes the code less readable.
> I'd suggest to reduce it to waitTime.

Ok, done.

>  103         notifyAgentToGetLocalAndWaitShortly(-1, 1);
>  ...
>  119                 notifyAgentToGetLocalAndWaitShortly(depth - 100, waitTimeInNativeAfterNotify);

> It is better to provide a short comment before each call explaining what it is doing.
> For instance, it is not clear why the call at the line 103 is needed.
> Why do we need to notify the agent to GetLocal for the second time?

The test is repeated TEST_ITERATIONS times. In each iteration the agent calls
GetLocal racing the target thread returning from the native call. The last call
in line 103 ist the shutdown signal.

> Can it be refactored into a separate native method?

I've made the shutdown process more explicit with the new native method
shutDown() which sets thest_state to ShutDown.

> Then the the function name can be reduced to 'notifyAgentToGetLocal'.
> This long name does not give enough context anyway.

Ok, done.

>   85         long iterations = 0;
>   87         do {
>   ...
>   97             iterations++;
>   ...
>  102         } while (iterations < TEST_ITERATIONS);

> Why a more explicit 'for' or 'while' loop is not used here? :
> for (long iter = 0; iter < TEST_ITERATIONS; iter++) { 
 
I have converted the loop into a for loop.

> http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3.inc/test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/libGetLocalWithoutSuspendTest.cpp.frames.html

> The indent in this file varies. It is better to keep it the same: 4 or 2.

Yes, I noticed this. I have not corrected it yet, because I didn't want to
pullute the incremental webrev with that change. Would you like me to fix the
indentation now to 2 spaces or do it as a last step?

>   60   AgentCallingGetLocalObject // The target thread waits for the agent to call
> I'd suggest to rename the constant to 'AgentInGetLocal'.

Ok, done.

>  150 GetLocalWithoutSuspendTestThreadLoop(jvmtiEnv * jvmti, JNIEnv* env, void * arg) {

> It is better rename the function to TestThreadLoop.

Would AgentThreadLoop be ok too?

> You can add a comment before to explain some basic about what it is doing.

Ok, done.

>  167     printf("*** AGENT: GetLocalWithoutSuspendTestThreadLoop thread started. Polling thread '%s' for local variables\n",
> It is better to get rid of leading stars in all messages.

Ok, done.

>  176         // the native method Java_GetLocalWithoutSuspendTest_notifyAgentToGetLocalAndWaitShortly
> The part 'Java_GetLocalWithoutSuspendTest_' can be removed from the function name.

Ok, done.

---
From: serguei.spitsyn at oracle.com <serguei.spitsyn at oracle.com> 
Sent: Freitag, 14. August 2020 10:11
To: Reingruber, Richard <richard.reingruber at sap.com>; David Holmes <david.holmes at oracle.com>; serviceability-dev at openjdk.java.net
Subject: Re: RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()

Hi Richard,

http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3.inc/test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java.frames.html

  33  *          the stack walk. The target thread's stack is walkable while in native. After sending the notification it
  ...

  54      * @param depth Depth of target frame for GetLocalObject() call. Should be large value to prolong the unsafe stack walk.
  55      * @param waitTimeInNativeAfterNotify Time to wait after notify with walkable stack before returning an becoming unsafe again.
  ...

  71      * Wait time in native, i.e. with walkable stack, after notifying agent thread to do GetLocalObject() call.
  ...

  89             msg((now -start) + " ms  Iteration : " + iterations + "  waitTimeInNativeAfterNotify : " + waitTimeInNativeAfterNotify);

Could you, please, re-balance the lines above to make them shorter?


  90             int newTargetDepth = recursiveMethod(0, targetDepth);
  91             if (newTargetDepth < targetDepth) {
  92                 msg("StackOverflowError during test.");
  93                 msg("Old target depth: " + targetDepth);
  94                 msg("Retry with new target depth: " + newTargetDepth);
  95                 targetDepth = newTargetDepth;
  96             }
A comment is needed to explain why a StackOverflowError is not desired.
At least, it is not obvious initially.
  73     public int waitTimeInNativeAfterNotify;

This name is unreasonably long which makes the code less readable.
I'd suggest to reduce it to waitTime.

 103         notifyAgentToGetLocalAndWaitShortly(-1, 1);
 ...
 119                 notifyAgentToGetLocalAndWaitShortly(depth - 100, waitTimeInNativeAfterNotify);

It is better to provide a short comment before each call explaining what it is doing.
For instance, it is not clear why the call at the line 103 is needed.
Why do we need to notify the agent to GetLocal for the second time?
Can it be refactored into a separate native method?
Then the the function name can be reduced to 'notifyAgentToGetLocal'.
This long name does not give enough context anyway.
  85         long iterations = 0;
  87         do {
  ...
  97             iterations++;
  ...
 102         } while (iterations < TEST_ITERATIONS);

Why a more explicit 'for' or 'while' loop is not used here? :
for (long iter = 0; iter < TEST_ITERATIONS; iter++) { 
 

http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3.inc/test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/libGetLocalWithoutSuspendTest.cpp.frames.html

The indent in this file varies. It is better to keep it the same: 4 or 2. 

  60   AgentCallingGetLocalObject // The target thread waits for the agent to call
I'd suggest to rename the constant to 'AgentInGetLocal'.
 150 GetLocalWithoutSuspendTestThreadLoop(jvmtiEnv * jvmti, JNIEnv* env, void * arg) {

It is better rename the function to TestThreadLoop.
You can add a comment before to explain some basic about what it is doing. 
 167     printf("*** AGENT: GetLocalWithoutSuspendTestThreadLoop thread started. Polling thread '%s' for local variables\n",
It is better to get rid of leading stars in all messages.
 176         // the native method Java_GetLocalWithoutSuspendTest_notifyAgentToGetLocalAndWaitShortly
The part 'Java_GetLocalWithoutSuspendTest_' can be removed from the function name.


I'm still reviewing the test native agent code.


Thanks,
Serguei


On 8/11/20 03:02, Reingruber, Richard wrote:
Hi David and Serguei,

On 11/08/2020 3:21 am, mailto:serguei.spitsyn at oracle.com wrote:
Hi Richard and David,

The implementation looks good to me.

But I do not understand what the test is doing with all this counters 
and recursions.

For instance, these fragments:

   86         recursions = 0;
   87         try {
   88             recursiveMethod(1<<20);
   89         } catch (StackOverflowError e) {
   90             msg("Caught StackOverflowError as expected");
   91         }
   92         int target_depth = recursions-100;    // spaces are missed around the '-' sigh

It is not obvious that the 'recursion' is updated in the recursiveMethod.
I would suggestto make it more explicit:
    recursiveMethod(M);
    int target_depth = M - 100;

Then the variable 'recursions' can be removed or become local.

The recursiveMethod takes in the maximum recursions to try and updates 
the recursions variable to record how many recursions were possible - so:

target_depth = <actual recursions> - 100;

Possibly recursiveMethod could return the actual recursions instead of 
using the global variables?

I've eliminated the static 'recursions' variable. recursiveMethod() now returns
the depth at which the recursion was ended. I hesitated doing this, because I
had to handle the StackOverflowError with all those frames still on stack. But
the handler is empty, so it should not cause problems.

This is the new webrev (as posted previously):

Webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3/
Delta:  http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.3.inc/

Thanks, Richard.

-----Original Message-----
From: David Holmes mailto:david.holmes at oracle.com 
Sent: Dienstag, 11. August 2020 04:00
To: mailto:serguei.spitsyn at oracle.com; Reingruber, Richard mailto:richard.reingruber at sap.com; mailto:serviceability-dev at openjdk.java.net
Subject: Re: RFR(S) 8249293: Unsafe stackwalk in VM_GetOrSetLocal::doit_prologue()

Hi Serguei,

On 11/08/2020 3:21 am, mailto:serguei.spitsyn at oracle.com wrote:
Hi Richard and David,

The implementation looks good to me.

But I do not understand what the test is doing with all this counters 
and recursions.

For instance, these fragments:

   86         recursions = 0;
   87         try {
   88             recursiveMethod(1<<20);
   89         } catch (StackOverflowError e) {
   90             msg("Caught StackOverflowError as expected");
   91         }
   92         int target_depth = recursions-100;    // spaces are missed around the '-' sigh

It is not obvious that the 'recursion' is updated in the recursiveMethod.
I would suggestto make it more explicit:
    recursiveMethod(M);
    int target_depth = M - 100;

Then the variable 'recursions' can be removed or become local.

The recursiveMethod takes in the maximum recursions to try and updates 
the recursions variable to record how many recursions were possible - so:

target_depth = <actual recursions> - 100;

Possibly recursiveMethod could return the actual recursions instead of 
using the global variables?

David
-----

This method will be:

   47     private static final int M = 1 << 20;
  ...
  121     public long recursiveMethod(int depth) {
  123         if (depth == 0) {
  124             notifyAgentToGetLocalAndWaitShortly(M - 100, waitTimeInNativeAfterNotify);
  126         } else {
  127             recursiveMethod(--depth);
  128         }
  129     }


At least, he test is missing the comments explaining all these.

Thanks,
Serguei



On 8/9/20 22:35, David Holmes wrote:
Hi Richard,

On 31/07/2020 5:28 pm, Reingruber, Richard wrote:
Hi,

I rebase the fix after JDK-8250042.

New webrev: http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.2/

The general fix for this seems good. A minor nit:

 588     if (!is_assignable(signature, ob_k, Thread::current())) {

You know that the current thread is the VMThread so can use 
VMThread::vm_thread().

Similarly for this existing code:

 694     Thread* current_thread = Thread::current();

---

Looking at the test code ... I'm less clear on exactly what is 
happening and the use of spin-waits raises some red-flags for me in 
terms of test reliability on different platforms. The "while 
(--waitCycles > 0)" loop in particular offers no certainty that the 
agent thread is executing anything in particular. And the use of the 
spin_count as a guide to future waiting time seems somewhat arbitrary. 
In all seriousness I got a headache trying to work out how the test 
was expecting to operate. Some parts could be simplified using raw 
monitors, I think. But there's no sure way to know the agent thread is 
in the midst of the stackwalk when the target thread wants to leave 
the native code. So I understand what you are trying to achieve here, 
I'm just not sure how reliably it will actually achieve it.

test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/libGetLocalWithoutSuspendTest.cpp 


 32 static volatile jlong spinn_count     = 0;

Using a 64-bit counter seems like it will be a problem on 32-bit systems.

Should be spin_count not spinn_count.

 36 // Agent thread waits for value != 0, then performas the JVMTI 
call to get local variable.

typo: performas

Thanks,
David
-----

Thanks, Richard.

-----Original Message-----
From: serviceability-dev mailto:serviceability-dev-retn at openjdk.java.net 
On Behalf Of Reingruber, Richard
Sent: Montag, 27. Juli 2020 09:45
To: mailto:serguei.spitsyn at oracle.com; mailto:serviceability-dev at openjdk.java.net
Subject: [CAUTION] RE: RFR(S) 8249293: Unsafe stackwalk in 
VM_GetOrSetLocal::doit_prologue()

Hi Serguei,

   > I tested it on Linux and Windows but not yet on MacOS.

The test succeeded now on all platforms.

Thanks, Richard.

-----Original Message-----
From: Reingruber, Richard
Sent: Freitag, 24. Juli 2020 15:04
To: mailto:serguei.spitsyn at oracle.com; mailto:serviceability-dev at openjdk.java.net
Subject: RE: RFR(S) 8249293: Unsafe stackwalk in 
VM_GetOrSetLocal::doit_prologue()

Hi Serguei,

The fix itself looks good to me.

thanks for looking at the fix.

I still need another look at new test.
Could you, please, convert the agent of new test to C++?
It will make it a little bit simpler.

Sure, here is the new webrev.1 with a C++ version of the test agent:

http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.1/

I tested it on Linux and Windows but not yet on MacOS.

Thanks,
Richard.

-----Original Message-----
From: mailto:serguei.spitsyn at oracle.com mailto:serguei.spitsyn at oracle.com
Sent: Freitag, 24. Juli 2020 00:00
To: Reingruber, Richard mailto:richard.reingruber at sap.com; 
mailto:serviceability-dev at openjdk.java.net
Subject: Re: RFR(S) 8249293: Unsafe stackwalk in 
VM_GetOrSetLocal::doit_prologue()

Hi Richard,

Thank you for filing the CR and taking care about it!
The fix itself looks good to me.
I still need another look at new test.
Could you, please, convert the agent of new test to C++?
It will make it a little bit simpler.

Thanks,
Serguei


On 7/20/20 01:15, Reingruber, Richard wrote:
Hi,

please help review this fix for VM_GetOrSetLocal. It moves the 
unsafe stackwalk from the vm
operation prologue before the safepoint into the doit() method 
executed at the safepoint.

Webrev: 
http://cr.openjdk.java.net/~rrich/webrevs/8249293/webrev.0/index.html
Bug: https://bugs.openjdk.java.net/browse/JDK-8249293

According to the JVMTI spec on local variable access it is not 
required to suspend the target thread
T [1]. The operation will simply fail with 
JVMTI_ERROR_NO_MORE_FRAMES if T is executing
bytecodes. It will succeed though if T is blocked because of 
synchronization or executing some native
code.

The issue is that in the latter case the stack walk in 
VM_GetOrSetLocal::doit_prologue() to prepare
the access to the local variable is unsafe, because it is done 
before the safepoint and it races
with T returning to execute bytecodes making its stack not walkable. 
The included test shows that
this can crash the VM if T wins the race.

Manual testing:

    - new test 
test/hotspot/jtreg/serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java
    - test/hotspot/jtreg/vmTestbase/nsk/jvmti
    - test/hotspot/jtreg/serviceability/jvmti

Nightly regression tests @SAP: JCK and JTREG, also in Xcomp mode, 
SPECjvm2008, SPECjbb2015,
Renaissance Suite, SAP specific tests with fastdebug and release 
builds on all platforms

Thanks, Richard.

[1] 
https://docs.oracle.com/en/java/javase/14/docs/specs/jvmti.html#local





More information about the serviceability-dev mailing list