RFR(M): 8244383: jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"
Chris Plummer
chris.plummer at oracle.com
Wed Jun 24 03:04:50 UTC 2020
On 6/23/20 7:07 PM, Daniel D. Daugherty wrote:
> Just one more comment on this part:
>
>> > L220: System.out.println("CurrentFrameGuess: choosing
>> interpreter frame: sp = " +
>> > L221: spFound + ", fpFound = "
>> + fp + ", pcFound = " + pc);
>> > This debug output doesn't make sense to me:
>> >
>> > "sp = " label and 'spFound' value
>> > "fpFound = " label and 'fp' value
>> > "pcFound = " label and 'pc' value
>>
>> but I may not have enough context...
>> From the point of view of the person reading the output, they want to
>> know the values for sp, fp, and pc. But within the code these values
>> are stored in the "found" variables.
>
> In that case, the code is wrong for the 'fp' and 'pc' outputs
> since you changed the labels and not the variables.
Yes, you are correct. I'll fix the output for fp and pc.
thanks,
Chris
>
> Dan
>
>
>
> On 6/23/20 7:28 PM, Chris Plummer wrote:
>> On 6/23/20 2:44 PM, Daniel D. Daugherty wrote:
>>> On 6/18/20 8:54 PM, Chris Plummer wrote:
>>>> [I've added runtime-dev to this SA review since understanding
>>>> interpreter invokes (code generated by
>>>> TemplateInterpreterGenerator::generate_normal_entry()) and stack
>>>> walking is probably more important than understanding SA.]
>>>>
>>>> Hello,
>>>>
>>>> Please help review the following:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8244383
>>>> http://cr.openjdk.java.net/~cjplummer/8244383/webrev.00/index.html
>>>
>> Thanks for helping!
>>> Sorry for the delay in reviewing this one. I've come back to it a
>>> couple
>>> of times because code like this is very hard to review.
>>>
>>>
>>> General comment:
>>> This fix reminds of the crazy things that AsyncGetCallTrace has to
>>> do in order to gather call trace data. I'm guessing that SA is
>>> attaching to the VM in an asynchronous manner and that's why it
>>> can observe things like partially constructed frames. If that's a
>>> correct guess, then how is SA stopping/suspending the threads?
>>> I'm just curious here.
>> On linux SA uses ptrace. I'm not familiar with the details of how it
>> works. I'm not sure where ptrace allows suspends to happen, but
>> certainly it has no knowledge of JVM safepoints or other
>> synchronization that the JVM does. So from the JVM and SA point of
>> view the suspend can happen at any arbitrary JVM instruction.
>>
>> From what I can gather, PTRACE_ATTACH suspends the entire process, so
>> that means all threads are suspended once you attach. However,
>> PTRACE_GETREGS can be called on individual threads (LWPs), but I
>> don't see any indication in the SA code that you need to attach to
>> each LWP first.
>>>
>>> Or this might be a case where SA is examining a core file in
>>> which case the various threads stacks are not necessarily at
>>> good/safepoint-safe pause points.
>> For this bug and test it's a live process, but I think the bug being
>> addressed here can happen just as well with a core file.
>> Unfortunately we have very little core file testing support. I'm
>> actually in the middle of addressing that right now.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java
>>>
>>> No comments.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java
>>>
>>> No comments.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java
>>>
>>> L104: // two locations, then we canot determine the frame.
>>> typo: s/canot/cannot/
>> ok
>>>
>>> L127: // it's validity will help us determine the state of
>>> the new frame push.
>>> typo: s/it's/its/
>> ok
>>>
>>> L148: System.out.println("CurrentFrameGuess: frame
>>> pushed but not initaliazed.");
>>> typo: s/initaliazed/initialized/
>> ok
>>>
>>> L220: System.out.println("CurrentFrameGuess:
>>> choosing interpreter frame: sp = " +
>>> L221: spFound + ", fpFound = "
>>> + fp + ", pcFound = " + pc);
>>> This debug output doesn't make sense to me:
>>>
>>> "sp = " label and 'spFound' value
>>> "fpFound = " label and 'fp' value
>>> "pcFound = " label and 'pc' value
>>>
>>> but I may not have enough context...
>> From the point of view of the person reading the output, they want to
>> know the values for sp, fp, and pc. But within the code these values
>> are stored in the "found" variables.
>>>
>>> With code like this, it's really hard to figure out if you've covered
>>> all the cases unless you've been in the observer seat yourself and
>>> even then your test runs may not hit all the possible cases. All you
>>> can really do is start with a set of adaptive changes, run with those
>>> for a while and tweak them as you gather more observations.
>> Yes, and I know there is still a very tiny gap or two in coverage
>> that are maybe one or two instructions long, but they aren't worth
>> dealing with. This bug was already very rare, and with the fixes I've
>> done I don't see any issues now. SA is a debugger, so perfection in
>> this regard is not expected.
>>>
>>> Chris, nice job with this bit of insanity!
>> Thanks! I mostly stuck with this one to help with my SA expertise.
>> Otherwise it wouldn't have been worth the time.
>>
>> Chris
>>>
>>> Thumbs up!
>>>
>>> Dan
>>>
>>>
>>>
>>>>
>>>> The crux of the bug is when doing stack walking the topmost frame
>>>> is in an inconsistent state because we are in the middle of pushing
>>>> a new interpreter frame. Basically we are executing code generated
>>>> by TemplateInterpreterGenerator::generate_normal_entry(). Since the
>>>> PC register is in this code, SA assumes the topmost frame is an
>>>> interpreter frame.
>>>>
>>>> The first issue with this interpreter frame assumption is if we
>>>> haven't actually pushed the frame yet, then the current frame is
>>>> the caller's frame, and could be compiled. But since SA thinks it's
>>>> interpreted, later on it tries to convert the frame->bcp to a BCI,
>>>> but frame->bcp is only valid for interpreter frames. Thus the
>>>> "illegal BCI" failures. If the previous frame happened to be
>>>> interpreted, then the existing SA code works fine.
>>>>
>>>> The other state of frame pushing that was problematic was when the
>>>> new frame had been pushed, but frame->method and frame->bcp were
>>>> not setup yet. This also would lead to "illegal BCI" later on
>>>> because garbage would be stored in these locations.
>>>>
>>>> Fixing the above problems requires trying to determine the state of
>>>> the frame push through a series of checks, and then adapting what
>>>> is considered to be the current frame based on the outcome of the
>>>> checks. The first things checked is that frame->method is valid (we
>>>> can successfully instantiate a wrapper for the Method* without
>>>> failure) and that frame->bcp is within the method. If both these
>>>> pass then we can use the frame as-is.
>>>>
>>>> If the above checks fail, then we try to determine whether the
>>>> issue is that the frame is not yet pushed and the current frame is
>>>> actually compiled, or the frame has been pushed but not yet
>>>> initialized. This is done by first getting the return address from
>>>> the stack or RAX (it's location depends on how far along we are in
>>>> the entry code) and comparing this to what is stored in
>>>> frame->return_addr. If they are the same, then we have pushed the
>>>> frame but not yet initialized it. In this case we use the previous
>>>> frame (senderSP() and senderFP()) as the current frame since the
>>>> current frame is not yet initialized. If the return address check
>>>> fails, then we assume the new frame is not yet pushed, and and
>>>> treat the current frame as compiled, even though PC points into the
>>>> interpreter (we replace PC with RAX in this case).
>>>>
>>>> Comments in the code pretty well explain all the above, so it is
>>>> probably easier to follow the logic in the code along with the
>>>> comments rather than apply my above description to the code.
>>>>
>>>> I should add that it's very rare that we ever get into this special
>>>> error handling code. This bug was very hard to reproduce initially.
>>>> I was only able to make progress with reproducing and debugging by
>>>> inserting delay loops in various spots in the code generated by
>>>> TemplateInterpreterGenerator::generate_normal_entry(). By doing
>>>> this I was able to reproduce the issue quite easily and hit all the
>>>> logic in the new code I've added.
>>>>
>>>> The fix is basically entirely contained within
>>>> AMD64CurrentFrameGuess.java. The rest of the changes are minor:
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java
>>>>
>>>> -Main fix for CR
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java
>>>>
>>>> -Added getInterpreterFrameBCP(), which is now needed by
>>>> AMD64CurrentFrameGuess.java
>>>> -I also simplified some code by using the existing
>>>> getInterpreterFrameMethod()
>>>> rather than replicating inline what it does.
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java
>>>>
>>>> -I noticed the windows version of this code had some extra checks
>>>> that were missing
>>>> from the bsd version. I then looked at the linux version, but it
>>>> had been heavily modified
>>>> a short while back to leverage DWARF info to determine frames. So
>>>> I looked at the previous
>>>> rev and it too had these extra checks. I decided to add them to
>>>> the BSD port. I'm not sure
>>>> if it helps at all, but it certainly doesn't seem to do any harm.
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>
>>
>>
>
More information about the serviceability-dev
mailing list