RFR(M): 8244383: jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"
Daniel D. Daugherty
daniel.daugherty at oracle.com
Wed Jun 24 02:07:03 UTC 2020
Just one more comment on this part:
> > L220: System.out.println("CurrentFrameGuess: choosing interpreter
> frame: sp = " +
> > L221: spFound + ", fpFound = " +
> fp + ", pcFound = " + pc);
> > This debug output doesn't make sense to me:
> >
> > "sp = " label and 'spFound' value
> > "fpFound = " label and 'fp' value
> > "pcFound = " label and 'pc' value
>
> but I may not have enough context...
> From the point of view of the person reading the output, they want to
> know the values for sp, fp, and pc. But within the code these values
> are stored in the "found" variables.
In that case, the code is wrong for the 'fp' and 'pc' outputs
since you changed the labels and not the variables.
Dan
On 6/23/20 7:28 PM, Chris Plummer wrote:
> On 6/23/20 2:44 PM, Daniel D. Daugherty wrote:
>> On 6/18/20 8:54 PM, Chris Plummer wrote:
>>> [I've added runtime-dev to this SA review since understanding
>>> interpreter invokes (code generated by
>>> TemplateInterpreterGenerator::generate_normal_entry()) and stack
>>> walking is probably more important than understanding SA.]
>>>
>>> Hello,
>>>
>>> Please help review the following:
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8244383
>>> http://cr.openjdk.java.net/~cjplummer/8244383/webrev.00/index.html
>>
> Thanks for helping!
>> Sorry for the delay in reviewing this one. I've come back to it a couple
>> of times because code like this is very hard to review.
>>
>>
>> General comment:
>> This fix reminds of the crazy things that AsyncGetCallTrace has to
>> do in order to gather call trace data. I'm guessing that SA is
>> attaching to the VM in an asynchronous manner and that's why it
>> can observe things like partially constructed frames. If that's a
>> correct guess, then how is SA stopping/suspending the threads?
>> I'm just curious here.
> On linux SA uses ptrace. I'm not familiar with the details of how it
> works. I'm not sure where ptrace allows suspends to happen, but
> certainly it has no knowledge of JVM safepoints or other
> synchronization that the JVM does. So from the JVM and SA point of
> view the suspend can happen at any arbitrary JVM instruction.
>
> From what I can gather, PTRACE_ATTACH suspends the entire process, so
> that means all threads are suspended once you attach. However,
> PTRACE_GETREGS can be called on individual threads (LWPs), but I don't
> see any indication in the SA code that you need to attach to each LWP
> first.
>>
>> Or this might be a case where SA is examining a core file in
>> which case the various threads stacks are not necessarily at
>> good/safepoint-safe pause points.
> For this bug and test it's a live process, but I think the bug being
> addressed here can happen just as well with a core file. Unfortunately
> we have very little core file testing support. I'm actually in the
> middle of addressing that right now.
>>
>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java
>>
>> No comments.
>>
>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java
>>
>> No comments.
>>
>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java
>>
>> L104: // two locations, then we canot determine the frame.
>> typo: s/canot/cannot/
> ok
>>
>> L127: // it's validity will help us determine the state of
>> the new frame push.
>> typo: s/it's/its/
> ok
>>
>> L148: System.out.println("CurrentFrameGuess: frame pushed
>> but not initaliazed.");
>> typo: s/initaliazed/initialized/
> ok
>>
>> L220: System.out.println("CurrentFrameGuess:
>> choosing interpreter frame: sp = " +
>> L221: spFound + ", fpFound = " +
>> fp + ", pcFound = " + pc);
>> This debug output doesn't make sense to me:
>>
>> "sp = " label and 'spFound' value
>> "fpFound = " label and 'fp' value
>> "pcFound = " label and 'pc' value
>>
>> but I may not have enough context...
> From the point of view of the person reading the output, they want to
> know the values for sp, fp, and pc. But within the code these values
> are stored in the "found" variables.
>>
>> With code like this, it's really hard to figure out if you've covered
>> all the cases unless you've been in the observer seat yourself and
>> even then your test runs may not hit all the possible cases. All you
>> can really do is start with a set of adaptive changes, run with those
>> for a while and tweak them as you gather more observations.
> Yes, and I know there is still a very tiny gap or two in coverage that
> are maybe one or two instructions long, but they aren't worth dealing
> with. This bug was already very rare, and with the fixes I've done I
> don't see any issues now. SA is a debugger, so perfection in this
> regard is not expected.
>>
>> Chris, nice job with this bit of insanity!
> Thanks! I mostly stuck with this one to help with my SA expertise.
> Otherwise it wouldn't have been worth the time.
>
> Chris
>>
>> Thumbs up!
>>
>> Dan
>>
>>
>>
>>>
>>> The crux of the bug is when doing stack walking the topmost frame is
>>> in an inconsistent state because we are in the middle of pushing a
>>> new interpreter frame. Basically we are executing code generated by
>>> TemplateInterpreterGenerator::generate_normal_entry(). Since the PC
>>> register is in this code, SA assumes the topmost frame is an
>>> interpreter frame.
>>>
>>> The first issue with this interpreter frame assumption is if we
>>> haven't actually pushed the frame yet, then the current frame is the
>>> caller's frame, and could be compiled. But since SA thinks it's
>>> interpreted, later on it tries to convert the frame->bcp to a BCI,
>>> but frame->bcp is only valid for interpreter frames. Thus the
>>> "illegal BCI" failures. If the previous frame happened to be
>>> interpreted, then the existing SA code works fine.
>>>
>>> The other state of frame pushing that was problematic was when the
>>> new frame had been pushed, but frame->method and frame->bcp were not
>>> setup yet. This also would lead to "illegal BCI" later on because
>>> garbage would be stored in these locations.
>>>
>>> Fixing the above problems requires trying to determine the state of
>>> the frame push through a series of checks, and then adapting what is
>>> considered to be the current frame based on the outcome of the
>>> checks. The first things checked is that frame->method is valid (we
>>> can successfully instantiate a wrapper for the Method* without
>>> failure) and that frame->bcp is within the method. If both these
>>> pass then we can use the frame as-is.
>>>
>>> If the above checks fail, then we try to determine whether the issue
>>> is that the frame is not yet pushed and the current frame is
>>> actually compiled, or the frame has been pushed but not yet
>>> initialized. This is done by first getting the return address from
>>> the stack or RAX (it's location depends on how far along we are in
>>> the entry code) and comparing this to what is stored in
>>> frame->return_addr. If they are the same, then we have pushed the
>>> frame but not yet initialized it. In this case we use the previous
>>> frame (senderSP() and senderFP()) as the current frame since the
>>> current frame is not yet initialized. If the return address check
>>> fails, then we assume the new frame is not yet pushed, and and treat
>>> the current frame as compiled, even though PC points into the
>>> interpreter (we replace PC with RAX in this case).
>>>
>>> Comments in the code pretty well explain all the above, so it is
>>> probably easier to follow the logic in the code along with the
>>> comments rather than apply my above description to the code.
>>>
>>> I should add that it's very rare that we ever get into this special
>>> error handling code. This bug was very hard to reproduce initially.
>>> I was only able to make progress with reproducing and debugging by
>>> inserting delay loops in various spots in the code generated by
>>> TemplateInterpreterGenerator::generate_normal_entry(). By doing this
>>> I was able to reproduce the issue quite easily and hit all the logic
>>> in the new code I've added.
>>>
>>> The fix is basically entirely contained within
>>> AMD64CurrentFrameGuess.java. The rest of the changes are minor:
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java
>>>
>>> -Main fix for CR
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java
>>>
>>> -Added getInterpreterFrameBCP(), which is now needed by
>>> AMD64CurrentFrameGuess.java
>>> -I also simplified some code by using the existing
>>> getInterpreterFrameMethod()
>>> rather than replicating inline what it does.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java
>>>
>>> -I noticed the windows version of this code had some extra checks
>>> that were missing
>>> from the bsd version. I then looked at the linux version, but it
>>> had been heavily modified
>>> a short while back to leverage DWARF info to determine frames. So I
>>> looked at the previous
>>> rev and it too had these extra checks. I decided to add them to the
>>> BSD port. I'm not sure
>>> if it helps at all, but it certainly doesn't seem to do any harm.
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>
>
>
More information about the serviceability-dev
mailing list