Conflicting use of StackWatermark in StackWalker vs GC?

Tue Feb 9 22:44:24 UTC 2021

On 2021-02-09 16:08, Roman Kennke wrote:
> Hi Stefan,
>
>> It's interesting that fetchNextBatch process the entire stack in 
>> preparation for filling in the information about the frames:
>>
>>      // If we have to get back here for even more frames, then 1) the 
>> user did not supply
>>      // an accurate hint suggesting the depth of the stack walk, and 
>> 2) we are not just
>>      // peeking  at a few frames. Take the cost of flushing out any 
>> pending deferred GC
>>      // processing of the stack.
>>      StackWatermarkSet::finish_processing(jt, NULL /* context */, 
>> StackWatermarkKind::gc);
>>
>> but further down in fill_in_frames => LiveFrameStream::fill_frame => 
>> fill_live_stackframe, we perform object allocation, which could 
>> safepoint for a GC that would reset the watermark. After leaving that 
>> safepoint we will have processed the top-most frames, but we won't 
>> have processed down the the current frame the StackWalker is looking 
>> at. This is my guess of what's happening, but I haven't been able to 
>> reproduce the problem, so it's a bit hard to verify that this is 
>> what's happening.
>
> That sounds plausible.
>
> What would be a way out of this? Scan the stack and collect all 
> relevant information without allocating any Java objects yet, and fill 
> in the Java frames array after the stack scan, maybe?

We have a way to deal with similar situations:

// Use this class to mark a remote thread you are currently interested
// in examining the entire stack, without it slipping into an unprocessed
// state at safepoint polls.
class KeepStackGCProcessedMark : public StackObj {

It installs a link to the other thread, and whenever we hit a safepoint 
that entire stack is processed. See:

void StackWatermark::on_safepoint() {
   start_processing();
   StackWatermark* linked_watermark = _linked_watermark;
   if (linked_watermark != NULL) {
     linked_watermark->finish_processing(NULL /* context */);
   }
}

KeepStackGCProcessedMark isn't reentrant, so we would have to watch out 
for that.

StefanK

>
> Roman
>
>
>> StefanK
>>
>> On 2021-02-09 15:08, Roman Kennke wrote:
>>> I am getting the same failure with ZGC:
>>>
>>> CONF=linux-x86_64-server-fastdebug make run-test 
>>> TEST=java/lang/StackWalker 
>>> TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseZGC 
>>> -XX:ZCollectionInterval=0.01"
>>>
>>>
>>>> Hello all,
>>>>
>>>> When running StackWalker tests with 'aggressive' Shenandoah mode 
>>>> (i.e. run GCs all the time, even if there is no work), then I 
>>>> observe crashes like this:
>>>>
>>>> #  Internal Error 
>>>> (/home/rkennke/src/openjdk/jdk/src/hotspot/share/runtime/stackWatermark.cpp:178), 
>>>> pid=549168, tid=549230
>>>> #  assert(is_frame_safe(f)) failed: Frame must be safe
>>>>
>>>> Full hs_err:
>>>> http://cr.openjdk.java.net/~rkennke/hs_err_pid549168.log
>>>>
>>>> I strongly suspect that this is happening because of StackWalker's 
>>>> use of StackWatermark which conflicts with the GC's own use of 
>>>> StackWalker. IOW, it asserts that the frame has been processed, but 
>>>> the GC is still on it.
>>>>
>>>> Are we missing some coordination between StackWalker and the GC here?
>>>>
>>>> It can be reproduced using:
>>>> CONF=linux-x86_64-server-fastdebug make run-test 
>>>> TEST=java/lang/StackWalker 
>>>> TEST_VM_OPTS="-XX:+UnlockExperimentalVMOptions -XX:+UseShenandoahGC 
>>>> -XX:ShenandoahGCHeuristics=aggressive"
>>>>
>>>> Thanks,
>>>> Roman
>>>
>>
>