RFR(S): JDK-8231635: SA Stackwalking code stuck in BasicTypeDataBase.findDynamicTypeForAddress()

Thu Nov 7 22:01:50 UTC 2019

Hi,

Please review the following fix for JDK-8231635:

https://bugs.openjdk.java.net/browse/JDK-8231635
http://cr.openjdk.java.net/~cjplummer/8231635/webrev.00/

I've tried to explain below to the best of my ability what's is going 
on, but keep in mind that I basically had no background in this area 
before looking into this CR, so this is all new to me. Please feel free 
to chime in with corrections to my explanation, or any additional 
insight that might help to further understanding of this code.

When doing a thread stack dump, SA has to figure out the SP for the 
current frame when it may not in fact be stored anywhere. So it goes 
through a series of guesses, starting with the current value of SP. See 
AMD64CurrentFrameGuess.run():

     Address sp  = context.getRegisterAsAddress(AMD64ThreadContext.RSP);

There are a number of checks done to see if this is the SP for the 
actual current frame, one of the checks being (and kind of a last 
resort) to follow the frame links and see if they eventually lead to the 
first entry frame:

             while (frame != null) {
               if (frame.isEntryFrame() && frame.entryFrameIsFirst()) {
                  ...
                  return true;
               }
               frame = frame.sender(map);
             }

If this fails, there is an outer loop to try the next address:

         for (long offset = 0;
              offset < regionInBytesToSearch;
              offset += vm.getAddressSize()) {

Note that offset is added to the initial SP value that was fetched from 
RSP. This approach is fraught with danger, because SP could be 
incorrect, and you can easily follow a bad frame link to an invalid 
address. So the body of this loop is in a try block that catches all 
Exceptions, and simply retries with the next offset if one is caught. 
Exceptions could be ones like UnalignedAddressException or 
UnmappedAddressException.

The bug in question turns up with the following harmless looking line:

               frame = frame.sender(map);

This is fine if you know that "frame" is valid, but what if it is not 
(which is very commonly the case). The frame values (SP, FP, and PC) in 
the returned frame could be just about anything, including being the 
same as the previous frame. This is what will happen if the SP stored in 
"frame" is the same as the SP that was used to initialize "frame" in the 
first place. This can certainly happen when SP is not valid to start 
with, and is indeed what caused this bug. The end result is the inner 
while loop gets stuck in an infinite loop traversing the same frame. So 
the fix is to add a check for this to make sure to break out of the 
while loop if this happens. Initially I did this with an Address.equal() 
call, and that seemed to fix the problem, but then I realized it would 
be possible to traverse through one or more sender frames and eventually 
end up returning to a previously visited frame, thus still an infinite 
loop. So I decided on checking for Address.lessThanOrEqual() instead 
since the send frame's SP should always be greater than the current 
frame's (referred to as oldFrame) SP. As long as we always move in one 
direction (towards a higher frame address), you can't have an infinite 
loop in this code.

I applied this fix to x86. Although not tested, it is built (all 
platform support is always built with SA). The x86 and amd64 versions 
are identical except for x86/amd64 references, so I thought it best to 
go ahead and do the update to x86. I did not touch ppc, but would be 
willing to update if someone passes along a fix that is tested.

One final bit of clarification. The bug synopsis mentions getting stuck 
in BasicTypeDataBase.findDynamicTypeForAddress(). This turns out to not 
actually be the case, but every stack trace I initially looked when I 
filed this CR was showing the thread being in this frame and at the same 
line number. This appears to be the next available safepoint where the 
thread can be suspended for stack dumping. When debugging this some more 
and adding a lot of println() calls in a lot of different locations, I 
started to see different frames in the stacktrace, presumably because 
the println() calls where adding additional safepoints.

thanks,

Chris