RFR(S): JDK-8231635: SA Stackwalking code stuck in BasicTypeDataBase.findDynamicTypeForAddress()
Chris Plummer
chris.plummer at oracle.com
Thu Nov 7 22:01:50 UTC 2019
Hi,
Please review the following fix for JDK-8231635:
https://bugs.openjdk.java.net/browse/JDK-8231635
http://cr.openjdk.java.net/~cjplummer/8231635/webrev.00/
I've tried to explain below to the best of my ability what's is going
on, but keep in mind that I basically had no background in this area
before looking into this CR, so this is all new to me. Please feel free
to chime in with corrections to my explanation, or any additional
insight that might help to further understanding of this code.
When doing a thread stack dump, SA has to figure out the SP for the
current frame when it may not in fact be stored anywhere. So it goes
through a series of guesses, starting with the current value of SP. See
AMD64CurrentFrameGuess.run():
Address sp = context.getRegisterAsAddress(AMD64ThreadContext.RSP);
There are a number of checks done to see if this is the SP for the
actual current frame, one of the checks being (and kind of a last
resort) to follow the frame links and see if they eventually lead to the
first entry frame:
while (frame != null) {
if (frame.isEntryFrame() && frame.entryFrameIsFirst()) {
...
return true;
}
frame = frame.sender(map);
}
If this fails, there is an outer loop to try the next address:
for (long offset = 0;
offset < regionInBytesToSearch;
offset += vm.getAddressSize()) {
Note that offset is added to the initial SP value that was fetched from
RSP. This approach is fraught with danger, because SP could be
incorrect, and you can easily follow a bad frame link to an invalid
address. So the body of this loop is in a try block that catches all
Exceptions, and simply retries with the next offset if one is caught.
Exceptions could be ones like UnalignedAddressException or
UnmappedAddressException.
The bug in question turns up with the following harmless looking line:
frame = frame.sender(map);
This is fine if you know that "frame" is valid, but what if it is not
(which is very commonly the case). The frame values (SP, FP, and PC) in
the returned frame could be just about anything, including being the
same as the previous frame. This is what will happen if the SP stored in
"frame" is the same as the SP that was used to initialize "frame" in the
first place. This can certainly happen when SP is not valid to start
with, and is indeed what caused this bug. The end result is the inner
while loop gets stuck in an infinite loop traversing the same frame. So
the fix is to add a check for this to make sure to break out of the
while loop if this happens. Initially I did this with an Address.equal()
call, and that seemed to fix the problem, but then I realized it would
be possible to traverse through one or more sender frames and eventually
end up returning to a previously visited frame, thus still an infinite
loop. So I decided on checking for Address.lessThanOrEqual() instead
since the send frame's SP should always be greater than the current
frame's (referred to as oldFrame) SP. As long as we always move in one
direction (towards a higher frame address), you can't have an infinite
loop in this code.
I applied this fix to x86. Although not tested, it is built (all
platform support is always built with SA). The x86 and amd64 versions
are identical except for x86/amd64 references, so I thought it best to
go ahead and do the update to x86. I did not touch ppc, but would be
willing to update if someone passes along a fix that is tested.
One final bit of clarification. The bug synopsis mentions getting stuck
in BasicTypeDataBase.findDynamicTypeForAddress(). This turns out to not
actually be the case, but every stack trace I initially looked when I
filed this CR was showing the thread being in this frame and at the same
line number. This appears to be the next available safepoint where the
thread can be suspended for stack dumping. When debugging this some more
and adding a lot of println() calls in a lot of different locations, I
started to see different frames in the stacktrace, presumably because
the println() calls where adding additional safepoints.
thanks,
Chris
More information about the serviceability-dev
mailing list