RFR(S): JDK-8231635: SA Stackwalking code stuck in BasicTypeDataBase.findDynamicTypeForAddress()
Daniil Titov
daniil.x.titov at oracle.com
Wed Nov 13 05:08:07 UTC 2019
Hi Chris,
The change looks good to me.
Thanks!
--Daniil
On 11/12/19, 11:06 AM, "serviceability-dev on behalf of Chris Plummer" <serviceability-dev-bounces at openjdk.java.net on behalf of chris.plummer at oracle.com> wrote:
Thanks Serguei!
Can I get one more review please?
thanks,
Chris
On 11/8/19 4:00 PM, serguei.spitsyn at oracle.com wrote:
> Hi Chris,
>
> This seems to be a good fix to have in any case.
> This check and bail out is right thing to do and should not break
> anything.
> I understand, this also fixes the test failures.
>
> I only had some experience a long time ago with the support of pstack
> and DTrace jstack action implementation which also does such SP
> recovering because the ebp can be used by JIT compiler as a general
> purpose register. There is no such a problem on sparc.
>
> Thanks,
> Serguei
>
>
> On 11/7/19 14:01, Chris Plummer wrote:
>> Hi,
>>
>> Please review the following fix for JDK-8231635:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8231635
>> http://cr.openjdk.java.net/~cjplummer/8231635/webrev.00/
>>
>> I've tried to explain below to the best of my ability what's is going
>> on, but keep in mind that I basically had no background in this area
>> before looking into this CR, so this is all new to me. Please feel
>> free to chime in with corrections to my explanation, or any
>> additional insight that might help to further understanding of this
>> code.
>>
>> When doing a thread stack dump, SA has to figure out the SP for the
>> current frame when it may not in fact be stored anywhere. So it goes
>> through a series of guesses, starting with the current value of SP.
>> See AMD64CurrentFrameGuess.run():
>>
>> Address sp = context.getRegisterAsAddress(AMD64ThreadContext.RSP);
>>
>> There are a number of checks done to see if this is the SP for the
>> actual current frame, one of the checks being (and kind of a last
>> resort) to follow the frame links and see if they eventually lead to
>> the first entry frame:
>>
>> while (frame != null) {
>> if (frame.isEntryFrame() && frame.entryFrameIsFirst()) {
>> ...
>> return true;
>> }
>> frame = frame.sender(map);
>> }
>>
>> If this fails, there is an outer loop to try the next address:
>>
>> for (long offset = 0;
>> offset < regionInBytesToSearch;
>> offset += vm.getAddressSize()) {
>>
>> Note that offset is added to the initial SP value that was fetched
>> from RSP. This approach is fraught with danger, because SP could be
>> incorrect, and you can easily follow a bad frame link to an invalid
>> address. So the body of this loop is in a try block that catches all
>> Exceptions, and simply retries with the next offset if one is caught.
>> Exceptions could be ones like UnalignedAddressException or
>> UnmappedAddressException.
>>
>> The bug in question turns up with the following harmless looking line:
>>
>> frame = frame.sender(map);
>>
>> This is fine if you know that "frame" is valid, but what if it is not
>> (which is very commonly the case). The frame values (SP, FP, and PC)
>> in the returned frame could be just about anything, including being
>> the same as the previous frame. This is what will happen if the SP
>> stored in "frame" is the same as the SP that was used to initialize
>> "frame" in the first place. This can certainly happen when SP is not
>> valid to start with, and is indeed what caused this bug. The end
>> result is the inner while loop gets stuck in an infinite loop
>> traversing the same frame. So the fix is to add a check for this to
>> make sure to break out of the while loop if this happens. Initially I
>> did this with an Address.equal() call, and that seemed to fix the
>> problem, but then I realized it would be possible to traverse through
>> one or more sender frames and eventually end up returning to a
>> previously visited frame, thus still an infinite loop. So I decided
>> on checking for Address.lessThanOrEqual() instead since the send
>> frame's SP should always be greater than the current frame's
>> (referred to as oldFrame) SP. As long as we always move in one
>> direction (towards a higher frame address), you can't have an
>> infinite loop in this code.
>>
>> I applied this fix to x86. Although not tested, it is built (all
>> platform support is always built with SA). The x86 and amd64 versions
>> are identical except for x86/amd64 references, so I thought it best
>> to go ahead and do the update to x86. I did not touch ppc, but would
>> be willing to update if someone passes along a fix that is tested.
>>
>> One final bit of clarification. The bug synopsis mentions getting
>> stuck in BasicTypeDataBase.findDynamicTypeForAddress(). This turns
>> out to not actually be the case, but every stack trace I initially
>> looked when I filed this CR was showing the thread being in this
>> frame and at the same line number. This appears to be the next
>> available safepoint where the thread can be suspended for stack
>> dumping. When debugging this some more and adding a lot of println()
>> calls in a lot of different locations, I started to see different
>> frames in the stacktrace, presumably because the println() calls
>> where adding additional safepoints.
>>
>> thanks,
>>
>> Chris
>>
>
More information about the serviceability-dev
mailing list