RFR(M): 8244383: jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"

Daniel D. Daugherty daniel.daugherty at oracle.com
Wed Jun 24 16:07:53 UTC 2020


On 6/23/20 11:04 PM, Chris Plummer wrote:
> On 6/23/20 7:07 PM, Daniel D. Daugherty wrote:
>> Just one more comment on this part:
>>
>>> >    L220: System.out.println("CurrentFrameGuess: choosing 
>>> interpreter frame: sp = " +
>>> >    L221:                                  spFound + ", fpFound = " 
>>> + fp + ", pcFound = " + pc);
>>> >        This debug output doesn't make sense to me:
>>> >
>>> >            "sp = " label and 'spFound' value
>>> >            "fpFound = " label and 'fp' value
>>> >            "pcFound = " label and 'pc' value
>>>
>>>         but I may not have enough context...
>>> From the point of view of the person reading the output, they want 
>>> to know the values for sp, fp, and pc. But within the code these 
>>> values are stored in the "found" variables. 
>>
>> In that case, the code is wrong for the 'fp' and 'pc' outputs
>> since you changed the labels and not the variables.
> Yes, you are correct. I'll fix the output for fp and pc.

I don't need another webrev for any of my comments.

Thanks for your work on making SA more stable. The CI appreciates it!

Dan


>
> thanks,
>
> Chris
>>
>> Dan
>>
>>
>>
>> On 6/23/20 7:28 PM, Chris Plummer wrote:
>>> On 6/23/20 2:44 PM, Daniel D. Daugherty wrote:
>>>> On 6/18/20 8:54 PM, Chris Plummer wrote:
>>>>> [I've added runtime-dev to this SA review since understanding 
>>>>> interpreter invokes (code generated by 
>>>>> TemplateInterpreterGenerator::generate_normal_entry()) and stack 
>>>>> walking is probably more important than understanding SA.]
>>>>>
>>>>> Hello,
>>>>>
>>>>> Please help review the following:
>>>>>
>>>>> https://bugs.openjdk.java.net/browse/JDK-8244383
>>>>> http://cr.openjdk.java.net/~cjplummer/8244383/webrev.00/index.html
>>>>
>>> Thanks for helping!
>>>> Sorry for the delay in reviewing this one. I've come back to it a 
>>>> couple
>>>> of times because code like this is very hard to review.
>>>>
>>>>
>>>> General comment:
>>>>     This fix reminds of the crazy things that AsyncGetCallTrace has to
>>>>     do in order to gather call trace data. I'm guessing that SA is
>>>>     attaching to the VM in an asynchronous manner and that's why it
>>>>     can observe things like partially constructed frames. If that's a
>>>>     correct guess, then how is SA stopping/suspending the threads?
>>>>     I'm just curious here.
>>> On linux SA uses ptrace. I'm not familiar with the details of how it 
>>> works. I'm not sure where ptrace allows suspends to happen, but 
>>> certainly it has no knowledge of JVM safepoints or other 
>>> synchronization that the JVM does. So from the JVM and SA point of 
>>> view the suspend can happen at any arbitrary JVM instruction.
>>>
>>> From what I can gather, PTRACE_ATTACH suspends the entire process, 
>>> so that means all threads are suspended once you attach. However, 
>>> PTRACE_GETREGS can be called on individual threads (LWPs), but I 
>>> don't see any indication in the SA code that you need to attach to 
>>> each LWP first.
>>>>
>>>>     Or this might be a case where SA is examining a core file in
>>>>     which case the various threads stacks are not necessarily at
>>>>     good/safepoint-safe pause points.
>>> For this bug and test it's a live process, but I think the bug being 
>>> addressed here can happen just as well with a core file. 
>>> Unfortunately we have very little core file testing support. I'm 
>>> actually in the middle of addressing that right now.
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java 
>>>>
>>>>     No comments.
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java 
>>>>
>>>>     No comments.
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java 
>>>>
>>>>     L104:     // two locations, then we canot determine the frame.
>>>>         typo: s/canot/cannot/
>>> ok
>>>>
>>>>     L127:     // it's validity will help us determine the state of 
>>>> the new frame push.
>>>>         typo: s/it's/its/
>>> ok
>>>>
>>>>     L148:         System.out.println("CurrentFrameGuess: frame 
>>>> pushed but not initaliazed.");
>>>>         typo: s/initaliazed/initialized/
>>> ok
>>>>
>>>>     L220: System.out.println("CurrentFrameGuess: choosing 
>>>> interpreter frame: sp = " +
>>>>     L221:                                  spFound + ", fpFound = " 
>>>> + fp + ", pcFound = " + pc);
>>>>         This debug output doesn't make sense to me:
>>>>
>>>>             "sp = " label and 'spFound' value
>>>>             "fpFound = " label and 'fp' value
>>>>             "pcFound = " label and 'pc' value
>>>>
>>>>         but I may not have enough context...
>>> From the point of view of the person reading the output, they want 
>>> to know the values for sp, fp, and pc. But within the code these 
>>> values are stored in the "found" variables.
>>>>
>>>> With code like this, it's really hard to figure out if you've covered
>>>> all the cases unless you've been in the observer seat yourself and
>>>> even then your test runs may not hit all the possible cases. All you
>>>> can really do is start with a set of adaptive changes, run with those
>>>> for a while and tweak them as you gather more observations.
>>> Yes, and I know there is still a very tiny gap or two in coverage 
>>> that are maybe one or two instructions long, but they aren't worth 
>>> dealing with. This bug was already very rare, and with the fixes 
>>> I've done I don't see any issues now. SA is a debugger, so 
>>> perfection in this regard is not expected.
>>>>
>>>> Chris, nice job with this bit of insanity!
>>> Thanks! I mostly stuck with this one to help with my SA expertise. 
>>> Otherwise it wouldn't have been worth the time.
>>>
>>> Chris
>>>>
>>>> Thumbs up!
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>>>
>>>>> The crux of the bug is when doing stack walking the topmost frame 
>>>>> is in an inconsistent state because we are in the middle of 
>>>>> pushing a new interpreter frame. Basically we are executing code 
>>>>> generated by 
>>>>> TemplateInterpreterGenerator::generate_normal_entry(). Since the 
>>>>> PC register is in this code, SA assumes the topmost frame is an 
>>>>> interpreter frame.
>>>>>
>>>>> The first issue with this interpreter frame assumption is if we 
>>>>> haven't actually pushed the frame yet, then the current frame is 
>>>>> the caller's frame, and could be compiled. But since SA thinks 
>>>>> it's interpreted, later on it tries to convert the frame->bcp to a 
>>>>> BCI, but frame->bcp is only valid for interpreter frames. Thus the 
>>>>> "illegal BCI" failures. If the previous frame happened to be 
>>>>> interpreted, then the existing SA code works fine.
>>>>>
>>>>> The other state of frame pushing that was problematic was when the 
>>>>> new frame had been pushed, but frame->method and frame->bcp were 
>>>>> not setup yet. This also would lead to "illegal BCI" later on 
>>>>> because garbage would be stored in these locations.
>>>>>
>>>>> Fixing the above problems requires trying to determine the state 
>>>>> of the frame push through a series of checks, and then adapting 
>>>>> what is considered to be the current frame based on the outcome of 
>>>>> the checks. The first things checked is that frame->method is 
>>>>> valid (we can successfully instantiate a wrapper for the Method* 
>>>>> without failure) and that frame->bcp is within the method. If both 
>>>>> these pass then we can use the frame as-is.
>>>>>
>>>>> If the above checks fail, then we try to determine whether the 
>>>>> issue is that the frame is not yet pushed and the current frame is 
>>>>> actually compiled, or the frame has been pushed but not yet 
>>>>> initialized. This is done by first getting the return address from 
>>>>> the stack or RAX (it's location depends on how far along we are in 
>>>>> the entry code) and comparing this to what is stored in 
>>>>> frame->return_addr. If they are the same, then we have pushed the 
>>>>> frame but not yet initialized it. In this case we use the previous 
>>>>> frame (senderSP() and senderFP()) as the current frame since the 
>>>>> current frame is not yet initialized. If the return address check 
>>>>> fails, then we assume the new frame is not yet pushed, and and 
>>>>> treat the current frame as compiled, even though PC points into 
>>>>> the interpreter (we replace PC with RAX in this case).
>>>>>
>>>>> Comments in the code pretty well explain all the above, so it is 
>>>>> probably easier to follow the logic in the code along with the 
>>>>> comments rather than apply my above description to the code.
>>>>>
>>>>> I should add that it's very rare that we ever get into this 
>>>>> special error handling code. This bug was very hard to reproduce 
>>>>> initially. I was only able to make progress with reproducing and 
>>>>> debugging by inserting delay loops in various spots in the code 
>>>>> generated by 
>>>>> TemplateInterpreterGenerator::generate_normal_entry(). By doing 
>>>>> this I was able to reproduce the issue quite easily and hit all 
>>>>> the logic in the new code I've added.
>>>>>
>>>>> The fix is basically entirely contained within 
>>>>> AMD64CurrentFrameGuess.java. The rest of the changes are minor:
>>>>>
>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java 
>>>>>
>>>>> -Main fix for CR
>>>>>
>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java 
>>>>>
>>>>> -Added getInterpreterFrameBCP(), which is now needed by 
>>>>> AMD64CurrentFrameGuess.java
>>>>> -I also simplified some code by using the existing 
>>>>> getInterpreterFrameMethod()
>>>>>  rather than replicating inline what it does.
>>>>>
>>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java 
>>>>>
>>>>> -I noticed the windows version of this code had some extra checks 
>>>>> that were missing
>>>>>  from the bsd version. I then looked at the linux version, but it 
>>>>> had been heavily modified
>>>>>  a short while back to leverage DWARF info to determine frames. So 
>>>>> I looked at the previous
>>>>>  rev and it too had these extra checks. I decided to add them to 
>>>>> the BSD port. I'm not sure
>>>>>  if it helps at all, but it certainly doesn't seem to do any harm.
>>>>>
>>>>> thanks,
>>>>>
>>>>> Chris
>>>>>
>>>>
>>>
>>>
>>
>
>



More information about the hotspot-runtime-dev mailing list