RFR(M): 8133749, 8133747, 8133740: NMT detail stack trace cleanup
Chris Plummer
chris.plummer at oracle.com
Fri Aug 5 07:05:55 UTC 2016
Hi David,
If fixing os::current_frame() to have a better name and also make it go
up one less frame makes these changes more palatable, I'm willing to
make that change. I would prefer to do it with a follow up CR (it would
probably have to be an RFE), but will do it with these changes if
necessary. I still pull hairs over the proper name for this method, even
if it is modified to return the frame of whoever called it. Usually the
meaning conveyed by a method's name does not change based on whether you
choose the caller's or callee's point of view, but in this case it does,
and I'm not sure which point of view makes more sense. If we choose the
caller's point of view, then the proper name remains
os::current_frame(). If we choose the callee's point of view, then it
should be os::callers_frame(). Maybe there's a name that is agnostic and
means the same thing from both view points. I just haven't thought of
one yet.
With respect to ALWAYSINLINE, it does not work for solaris and windows
slowdebug builds. Note the special case in the test I wrote to allow for
AllocateHeap() in the stack trace in this case, even though it shouldn't
be there because it uses ALWAYSINLINE. I could have made changes in the
source to get rid of it from the stack trace, but I didn't feel the
source code disruption was worth it for a slowdebug build, especially
since there are only a allocation call sites where it is a problem. I
could use ALWAYSINLINE for the cases where it will work to inline
_get_previous_fp, but I don't really see that as being any more reliable
than what is there now.
As for making _get_previous_fp() a macro, that's made more complicated
because it has #ifdefs already. I could move its implementation directly
into os::current_frame(). That would fix the inlining problem. I think
it could also use some cleanup with the #ifdefs. For example, for
linux-x86 do we have to worry about the SPARC_WORKS and __clang__ cases?
And yes, even with my changes the code is no less fragile, and no less
misdirected in its approach to getting a consistent allocation back
trace. As I see it, there are 3 options:
(1) Do nothing, and leave it both broken and fragile.
(2) Do the cleanup I've done to at least correct the known stack trace
issues.
(3) Find another solution that doesn't suffer from these fragility issues.
Note that (3) does not preclude doing (2) first, and (2) seems a better
alternative than leaving it in its broken state (1). That's why I have
pursued these changes even though I know things will still be fragile.
thanks,
Chris
On 8/4/16 9:47 PM, David Holmes wrote:
> Hi Chris,
>
> On 5/08/2016 7:53 AM, Chris Plummer wrote:
>> Ping!
>
> I took another look at this and my earlier comments from JDK-8133749.
> I hate to see the functionality "fixed" yet still have a completely
> confusing and mis-named API. I'm still far from convinced that
> returning the callers caller wasn't an "error" that was done due to
> the lack of inlining and the appearance of an unexpected stackframe.
> You've now made things consistent - but os::current_frame() is
> completely mis-leading in name. And I'm still concerned that
> correctness here depends on C compiler inlining choices, with no way
> to verify at build time that they were indeed inlined or not! Don't we
> have ALWAYSINLINE to mark things like _get_previous_fp ? For that
> matter shouldn't _get_previous_fp be a macro so inlining plays no role ?
>
> Sorry but this code seems to simply limp from one broken state to
> another due to its fragility.
>
> Thanks,
> David
> -----
>
>> On 8/2/16 1:31 PM, Chris Plummer wrote:
>>> Hello,
>>>
>>> Please review the following:
>>>
>>> webrev:
>>> http://cr.openjdk.java.net/~cjplummer/8133749-8133747-8133740/webrev-01/webrev.hotspot/
>>>
>>>
>>>
>>> Bugs fixed:
>>>
>>> JDK-8133749: os::current_frame() is not returning the proper frame on
>>> ARM and solaris-x64
>>> https://bugs.openjdk.java.net/browse/JDK-8133749
>>>
>>> JDK-8133747: NMT includes an extra stack frame due to assumption NMT
>>> is making on tail calls being used
>>> https://bugs.openjdk.java.net/browse/JDK-8133747
>>>
>>> JDK-8133740: NMT for Linux/x86/x64 and bsd/x64 slowdebug builds
>>> includes NativeCallStack::NativeCallStack() frame in backtrace
>>> https://bugs.openjdk.java.net/browse/JDK-8133740
>>>
>>> The above bugs all result in the NMT detail stack traces including
>>> extra frames in the stack traces. Certain frames are suppose to be
>>> skipped, but sometimes are not. The frames that show up are:
>>>
>>> NativeCallStack::NativeCallStack
>>> os::get_native_stack
>>>
>>> These are both methods used to generate the stack trace, and therefore
>>> should not be included it. However, under some (most) circumstances,
>>> they were.
>>>
>>> Also, there was no test to make sure that any NMT detail output is
>>> generated, or that it is correct. I've added one with this webrev. Of
>>> the 27 possible builds (9 platforms * 3 build flavors), only 9 of the
>>> 27 initially passed this new test. They were the product and fastdebug
>>> builds for solaris-sparc, bsd-x64, and linux-x64; and the slowdebug
>>> builds for solaris-x64, windows-x86, and windows-x64. All the rest
>>> failed. They now all pass with my fixes in place.
>>>
>>> Here's a summary of the changes:
>>>
>>> src/os/posix/vm/os_posix.cpp
>>> src/os/windows/vm/os_windows.cpp
>>>
>>> JDK-8133747 fixes: There was some frame skipping logic here which was
>>> sort of correct, but was misplace. There are no extra frames being
>>> added in os::get_native_stack() due to lack of inlining or lack of a
>>> tail call, so no need for toSkip++ here. The logic has been moved to
>>> NativeCallStack::NativeCallStack, which is where the tail call is
>>> (sometimes) made, and also corrected (see nativeCallStack.cpp below).
>>>
>>> src/share/vm/utilities/nativeCallStack.cpp
>>>
>>> JDK-8133747 fixes: The frame skipping logic that was moved here
>>> assumed that NativeCallStack::NativeCallStack would not appear in the
>>> call stack (due to a tail call be using to call os::get_native_stack)
>>> except in slow debug builds. However, some platforms also don't use a
>>> tail call even when optimized. From what I can tell that is the case
>>> for 32-bit platforms and for windows.
>>>
>>> src/os_cpu/bsd_x86/vm/os_bsd_x86.cpp
>>> src/os_cpu/windows_x86/vm/os_windows_x86.cpp
>>> src/os_cpu/linux_x86/vm/os_linux_x86.cpp
>>>
>>> JDK-8133740 fixes: When _get_previous_fp is not inlined, we need to
>>> skip one extra frame
>>>
>>> src/os_cpu/solaris_x86/vm/os_solaris_x86.cpp
>>>
>>> JDK-8133749 fixes: os:current_frame() was not consistent with other
>>> platforms and needs to skip one more frame. This means it returns the
>>> frame for the caller's caller. So when called by
>>> os:get_native_stack(), it returns the frame for whoever called
>>> os::get_native_stack(). Although not intuitive, this is what
>>> os:get_native_stack() expects. Probably a method rename and/or a
>>> behavior change is justified here, but I would prefer to do that with
>>> a followup CR if anyone has a good suggestion on what to do.
>>>
>>> test/runtime/NMT/CheckForProperDetailStackTrace.java
>>>
>>> This is the new NTM detail test. It checks for frames that shouldn't
>>> be present and validates at least one stack trace is what is expected.
>>>
>>> I verified that the above test now passes on all supported platforms,
>>> and also did a full jprt "-testset hotpot" run. I plan on doing some
>>> RBT testing with NMT detail enabled before committing.
>>>
>>> Regarding the community contributed ports that Oracle does not
>>> support, I didn't make any changes there, but it looks like some of
>>> these bugs do exist. Notably:
>>>
>>> -linux-aarch64: Looks like it suffers from JDK-8133740. The changes
>>> done to the
>>> os_linux_x86.cp should also be applied here.
>>> -linux-ppc: Hard to say for sure since the implementation of
>>> os::current_frame is
>>> different than others, but it looks to me like it suffers from both
>>> JDK-8133749
>>> and JDK-8133740.
>>> -aix-ppc: Looks to be the same implementation as linux-ppc, so would
>>> need the
>>> same changes.
>>>
>>> These ports may also be suffering from JDK-8133747, but that fix is in
>>> shared code (nativeCallStack.cpp). My changes there will need some
>>> tweaking for these ports they don't use a tail call to call
>>> os::get_native_stack().
>>>
>>> If the maintainers of these ports could send me some NMT detail
>>> output, I can advise better on what changes are needed. Then you can
>>> implement and test them, and then send them back to me and I'll
>>> include them with my changes. What I need is the following command run
>>> on product and slowdebug builds. Initially run without any of my
>>> changes applied. If needed I may followup with a request that they be
>>> run with the changes applied:
>>>
>>> bin/java -XX:+UnlockDiagnosticVMOptions
>>> -XX:NativeMemoryTracking=detail -XX:+PrintNMTStatistics -version
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>
More information about the hotspot-runtime-dev
mailing list