RFR(S): JDK-8231635: SA Stackwalking code stuck in BasicTypeDataBase.findDynamicTypeForAddress()
Chris Plummer
chris.plummer at oracle.com
Fri Nov 8 21:27:12 UTC 2019
Ping!
On 11/7/19 2:01 PM, Chris Plummer wrote:
> Hi,
>
> Please review the following fix for JDK-8231635:
>
> https://bugs.openjdk.java.net/browse/JDK-8231635
> http://cr.openjdk.java.net/~cjplummer/8231635/webrev.00/
>
> I've tried to explain below to the best of my ability what's is going
> on, but keep in mind that I basically had no background in this area
> before looking into this CR, so this is all new to me. Please feel
> free to chime in with corrections to my explanation, or any additional
> insight that might help to further understanding of this code.
>
> When doing a thread stack dump, SA has to figure out the SP for the
> current frame when it may not in fact be stored anywhere. So it goes
> through a series of guesses, starting with the current value of SP.
> See AMD64CurrentFrameGuess.run():
>
> Address sp = context.getRegisterAsAddress(AMD64ThreadContext.RSP);
>
> There are a number of checks done to see if this is the SP for the
> actual current frame, one of the checks being (and kind of a last
> resort) to follow the frame links and see if they eventually lead to
> the first entry frame:
>
> while (frame != null) {
> if (frame.isEntryFrame() && frame.entryFrameIsFirst()) {
> ...
> return true;
> }
> frame = frame.sender(map);
> }
>
> If this fails, there is an outer loop to try the next address:
>
> for (long offset = 0;
> offset < regionInBytesToSearch;
> offset += vm.getAddressSize()) {
>
> Note that offset is added to the initial SP value that was fetched
> from RSP. This approach is fraught with danger, because SP could be
> incorrect, and you can easily follow a bad frame link to an invalid
> address. So the body of this loop is in a try block that catches all
> Exceptions, and simply retries with the next offset if one is caught.
> Exceptions could be ones like UnalignedAddressException or
> UnmappedAddressException.
>
> The bug in question turns up with the following harmless looking line:
>
> frame = frame.sender(map);
>
> This is fine if you know that "frame" is valid, but what if it is not
> (which is very commonly the case). The frame values (SP, FP, and PC)
> in the returned frame could be just about anything, including being
> the same as the previous frame. This is what will happen if the SP
> stored in "frame" is the same as the SP that was used to initialize
> "frame" in the first place. This can certainly happen when SP is not
> valid to start with, and is indeed what caused this bug. The end
> result is the inner while loop gets stuck in an infinite loop
> traversing the same frame. So the fix is to add a check for this to
> make sure to break out of the while loop if this happens. Initially I
> did this with an Address.equal() call, and that seemed to fix the
> problem, but then I realized it would be possible to traverse through
> one or more sender frames and eventually end up returning to a
> previously visited frame, thus still an infinite loop. So I decided on
> checking for Address.lessThanOrEqual() instead since the send frame's
> SP should always be greater than the current frame's (referred to as
> oldFrame) SP. As long as we always move in one direction (towards a
> higher frame address), you can't have an infinite loop in this code.
>
> I applied this fix to x86. Although not tested, it is built (all
> platform support is always built with SA). The x86 and amd64 versions
> are identical except for x86/amd64 references, so I thought it best to
> go ahead and do the update to x86. I did not touch ppc, but would be
> willing to update if someone passes along a fix that is tested.
>
> One final bit of clarification. The bug synopsis mentions getting
> stuck in BasicTypeDataBase.findDynamicTypeForAddress(). This turns out
> to not actually be the case, but every stack trace I initially looked
> when I filed this CR was showing the thread being in this frame and at
> the same line number. This appears to be the next available safepoint
> where the thread can be suspended for stack dumping. When debugging
> this some more and adding a lot of println() calls in a lot of
> different locations, I started to see different frames in the
> stacktrace, presumably because the println() calls where adding
> additional safepoints.
>
> thanks,
>
> Chris
>
More information about the serviceability-dev
mailing list