RFR(M): 8244383: jhsdb/HeapDumpTestWithActiveProcess.java fails with "AssertionFailure: illegal bci"

Wed Jun 24 03:04:50 UTC 2020

On 6/23/20 7:07 PM, Daniel D. Daugherty wrote:
> Just one more comment on this part:
>
>> >    L220: System.out.println("CurrentFrameGuess: choosing 
>> interpreter frame: sp = " +
>> >    L221:                                  spFound + ", fpFound = " 
>> + fp + ", pcFound = " + pc);
>> >        This debug output doesn't make sense to me:
>> >
>> >            "sp = " label and 'spFound' value
>> >            "fpFound = " label and 'fp' value
>> >            "pcFound = " label and 'pc' value
>>
>>         but I may not have enough context...
>> From the point of view of the person reading the output, they want to 
>> know the values for sp, fp, and pc. But within the code these values 
>> are stored in the "found" variables. 
>
> In that case, the code is wrong for the 'fp' and 'pc' outputs
> since you changed the labels and not the variables.
Yes, you are correct. I'll fix the output for fp and pc.

thanks,

Chris
>
> Dan
>
>
>
> On 6/23/20 7:28 PM, Chris Plummer wrote:
>> On 6/23/20 2:44 PM, Daniel D. Daugherty wrote:
>>> On 6/18/20 8:54 PM, Chris Plummer wrote:
>>>> [I've added runtime-dev to this SA review since understanding 
>>>> interpreter invokes (code generated by 
>>>> TemplateInterpreterGenerator::generate_normal_entry()) and stack 
>>>> walking is probably more important than understanding SA.]
>>>>
>>>> Hello,
>>>>
>>>> Please help review the following:
>>>>
>>>> https://bugs.openjdk.java.net/browse/JDK-8244383
>>>> http://cr.openjdk.java.net/~cjplummer/8244383/webrev.00/index.html
>>>
>> Thanks for helping!
>>> Sorry for the delay in reviewing this one. I've come back to it a 
>>> couple
>>> of times because code like this is very hard to review.
>>>
>>>
>>> General comment:
>>>     This fix reminds of the crazy things that AsyncGetCallTrace has to
>>>     do in order to gather call trace data. I'm guessing that SA is
>>>     attaching to the VM in an asynchronous manner and that's why it
>>>     can observe things like partially constructed frames. If that's a
>>>     correct guess, then how is SA stopping/suspending the threads?
>>>     I'm just curious here.
>> On linux SA uses ptrace. I'm not familiar with the details of how it 
>> works. I'm not sure where ptrace allows suspends to happen, but 
>> certainly it has no knowledge of JVM safepoints or other 
>> synchronization that the JVM does. So from the JVM and SA point of 
>> view the suspend can happen at any arbitrary JVM instruction.
>>
>> From what I can gather, PTRACE_ATTACH suspends the entire process, so 
>> that means all threads are suspended once you attach. However, 
>> PTRACE_GETREGS can be called on individual threads (LWPs), but I 
>> don't see any indication in the SA code that you need to attach to 
>> each LWP first.
>>>
>>>     Or this might be a case where SA is examining a core file in
>>>     which case the various threads stacks are not necessarily at
>>>     good/safepoint-safe pause points.
>> For this bug and test it's a live process, but I think the bug being 
>> addressed here can happen just as well with a core file. 
>> Unfortunately we have very little core file testing support. I'm 
>> actually in the middle of addressing that right now.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java 
>>>
>>>     No comments.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java 
>>>
>>>     No comments.
>>>
>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java 
>>>
>>>     L104:     // two locations, then we canot determine the frame.
>>>         typo: s/canot/cannot/
>> ok
>>>
>>>     L127:     // it's validity will help us determine the state of 
>>> the new frame push.
>>>         typo: s/it's/its/
>> ok
>>>
>>>     L148:         System.out.println("CurrentFrameGuess: frame 
>>> pushed but not initaliazed.");
>>>         typo: s/initaliazed/initialized/
>> ok
>>>
>>>     L220:               System.out.println("CurrentFrameGuess: 
>>> choosing interpreter frame: sp = " +
>>>     L221:                                  spFound + ", fpFound = " 
>>> + fp + ", pcFound = " + pc);
>>>         This debug output doesn't make sense to me:
>>>
>>>             "sp = " label and 'spFound' value
>>>             "fpFound = " label and 'fp' value
>>>             "pcFound = " label and 'pc' value
>>>
>>>         but I may not have enough context...
>> From the point of view of the person reading the output, they want to 
>> know the values for sp, fp, and pc. But within the code these values 
>> are stored in the "found" variables.
>>>
>>> With code like this, it's really hard to figure out if you've covered
>>> all the cases unless you've been in the observer seat yourself and
>>> even then your test runs may not hit all the possible cases. All you
>>> can really do is start with a set of adaptive changes, run with those
>>> for a while and tweak them as you gather more observations.
>> Yes, and I know there is still a very tiny gap or two in coverage 
>> that are maybe one or two instructions long, but they aren't worth 
>> dealing with. This bug was already very rare, and with the fixes I've 
>> done I don't see any issues now. SA is a debugger, so perfection in 
>> this regard is not expected.
>>>
>>> Chris, nice job with this bit of insanity!
>> Thanks! I mostly stuck with this one to help with my SA expertise. 
>> Otherwise it wouldn't have been worth the time.
>>
>> Chris
>>>
>>> Thumbs up!
>>>
>>> Dan
>>>
>>>
>>>
>>>>
>>>> The crux of the bug is when doing stack walking the topmost frame 
>>>> is in an inconsistent state because we are in the middle of pushing 
>>>> a new interpreter frame. Basically we are executing code generated 
>>>> by TemplateInterpreterGenerator::generate_normal_entry(). Since the 
>>>> PC register is in this code, SA assumes the topmost frame is an 
>>>> interpreter frame.
>>>>
>>>> The first issue with this interpreter frame assumption is if we 
>>>> haven't actually pushed the frame yet, then the current frame is 
>>>> the caller's frame, and could be compiled. But since SA thinks it's 
>>>> interpreted, later on it tries to convert the frame->bcp to a BCI, 
>>>> but frame->bcp is only valid for interpreter frames. Thus the 
>>>> "illegal BCI" failures. If the previous frame happened to be 
>>>> interpreted, then the existing SA code works fine.
>>>>
>>>> The other state of frame pushing that was problematic was when the 
>>>> new frame had been pushed, but frame->method and frame->bcp were 
>>>> not setup yet. This also would lead to "illegal BCI" later on 
>>>> because garbage would be stored in these locations.
>>>>
>>>> Fixing the above problems requires trying to determine the state of 
>>>> the frame push through a series of checks, and then adapting what 
>>>> is considered to be the current frame based on the outcome of the 
>>>> checks. The first things checked is that frame->method is valid (we 
>>>> can successfully instantiate a wrapper for the Method* without 
>>>> failure) and that frame->bcp is within the method. If both these 
>>>> pass then we can use the frame as-is.
>>>>
>>>> If the above checks fail, then we try to determine whether the 
>>>> issue is that the frame is not yet pushed and the current frame is 
>>>> actually compiled, or the frame has been pushed but not yet 
>>>> initialized. This is done by first getting the return address from 
>>>> the stack or RAX (it's location depends on how far along we are in 
>>>> the entry code) and comparing this to what is stored in 
>>>> frame->return_addr. If they are the same, then we have pushed the 
>>>> frame but not yet initialized it. In this case we use the previous 
>>>> frame (senderSP() and senderFP()) as the current frame since the 
>>>> current frame is not yet initialized. If the return address check 
>>>> fails, then we assume the new frame is not yet pushed, and and 
>>>> treat the current frame as compiled, even though PC points into the 
>>>> interpreter (we replace PC with RAX in this case).
>>>>
>>>> Comments in the code pretty well explain all the above, so it is 
>>>> probably easier to follow the logic in the code along with the 
>>>> comments rather than apply my above description to the code.
>>>>
>>>> I should add that it's very rare that we ever get into this special 
>>>> error handling code. This bug was very hard to reproduce initially. 
>>>> I was only able to make progress with reproducing and debugging by 
>>>> inserting delay loops in various spots in the code generated by 
>>>> TemplateInterpreterGenerator::generate_normal_entry(). By doing 
>>>> this I was able to reproduce the issue quite easily and hit all the 
>>>> logic in the new code I've added.
>>>>
>>>> The fix is basically entirely contained within 
>>>> AMD64CurrentFrameGuess.java. The rest of the changes are minor:
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/amd64/AMD64CurrentFrameGuess.java 
>>>>
>>>> -Main fix for CR
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/runtime/x86/X86Frame.java 
>>>>
>>>> -Added getInterpreterFrameBCP(), which is now needed by 
>>>> AMD64CurrentFrameGuess.java
>>>> -I also simplified some code by using the existing 
>>>> getInterpreterFrameMethod()
>>>>  rather than replicating inline what it does.
>>>>
>>>> src/jdk.hotspot.agent/share/classes/sun/jvm/hotspot/debugger/bsd/amd64/BsdAMD64CFrame.java 
>>>>
>>>> -I noticed the windows version of this code had some extra checks 
>>>> that were missing
>>>>  from the bsd version. I then looked at the linux version, but it 
>>>> had been heavily modified
>>>>  a short while back to leverage DWARF info to determine frames. So 
>>>> I looked at the previous
>>>>  rev and it too had these extra checks. I decided to add them to 
>>>> the BSD port. I'm not sure
>>>>  if it helps at all, but it certainly doesn't seem to do any harm.
>>>>
>>>> thanks,
>>>>
>>>> Chris
>>>>
>>>
>>
>>
>