RFR: JDK-8296437: NMT incurs costs if disabled [v2]
Thomas Stuefe
stuefe at openjdk.org
Sat Nov 12 07:05:32 UTC 2022
> While investigating the performance of the os::malloc wrapper, I noticed that we spend a lot of cycles copying empty callstacks around, even if NMT is disabled.
>
> The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :
>
>
> #define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
> NativeCallStack(0) : NativeCallStack::empty_stack())
> #define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
> NativeCallStack(1) : NativeCallStack::empty_stack())
>
>
> and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:
>
>
> void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
>
>
> In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).
>
> However, that does not work as intended. The C++ compiler - at least gcc on linux - interprets these as copy-by-value and generates code that always laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads, and two 16byte moves (the NMT stacks are by default 4 frames, so 4 pointer-sized slots):
>
>
> 0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
> ...
> # Load tracking level
> cb9a77: 48 8d 1d 02 35 78 00 lea 0x783502(%rip),%rbx # 143cf80 <_ZN10MemTracker15_tracking_levelE>
> cb9a7e: 8b 03 mov (%rbx),%eax
> # detail (3) tracking?
> cb9a80: 83 f8 03 cmp $0x3,%eax
> # yes: go and collect callstack
> cb9a83: 0f 84 57 01 00 00 je cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
> # no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
> cb9a89: 48 8d 05 30 44 78 00 lea 0x784430(%rip),%rax # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
> cb9a90: f3 0f 6f 00 movdqu (%rax),%xmm0
> cb9a94: f3 0f 6f 48 10 movdqu 0x10(%rax),%xmm1
> cb9a99: 0f 11 45 c0 movups %xmm0,-0x40(%rbp)
> cb9a9d: 0f 11 4d d0 movups %xmm1,-0x30(%rbp)
> ...
> # do the actual malloc:
> cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc at plt>
>
> # call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
> cb9b0f: 48 8d 4d c0 lea -0x40(%rbp),%rcx
> ...
> cb9b19: e8 f2 b7 f3 ff callq bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
>
>
> This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.
>
> ---------------------
>
> The patch changes the macros to avoid initialization of `NativeCallStack` if NMT is off or in summary mode only.
>
> This was a bit tricky to do, since I wanted the compiler to not do anything if NMT is disabled, and of course I did not want to change the semantics of CALLER|CURRENT_PC.
>
> In the end I settled for exchanging the explicit calls to `NativeCallStack::empty_stack()` to calls to the default constructor. I changed the default constructor to a no-op. So the NativeCallStack object is not initialized, the compiler optimizes the empty constructor call away. In NMT=off, we are done; in NMT=summary mode, we now just hand down the pointer to the uninitialized NativeCallStack to MallocTracker::record_malloc(), which will ignore it anyway:
>
>
> 0000000000cb98f0 <_ZN2os6mallocEm8MEMFLAGS>:
> ...
> # load tracking level
> cb9907: 48 8d 1d 72 46 78 00 lea 0x784672(%rip),%rbx # 143df80 <_ZN10MemTracker15_tracking_levelE>
> cb990e: 8b 03 mov (%rbx),%eax
> # detail (3) tracking?
> cb9910: 83 f8 03 cmp $0x3,%eax
> # yes: go and collect callstack
> cb9913: 0f 84 37 01 00 00 je cb9a50 <_ZN2os6mallocEm8MEMFLAGS+0x160>
> # no: nothing more to do ...
> ...
> # do the actual malloc:
> cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc at plt>
> ...
> # call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX). The stack remains uninitialized, that is fine, since the MallocTracker will ignore it anyway:
> cb9987: 48 8d 4d c0 lea -0x40(%rbp),%rcx
> ..
> cb9991: e8 ba b8 f3 ff callq bf5250 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
>
>
> There were only two callers of the default constructor that used it, and I changed them to use `NativeCallStack ncs(NULL, 0);` which is functionally equivalent.
>
> --------------
>
> Results:
>
> When profiling, I see os::malloc now needs less cycles, and the hotspot around the xmm instructions is not there anymore.
Thomas Stuefe has updated the pull request incrementally with two additional commits since the last revision:
- reduce unnecessary diffs
- explicit constructor for fake callstacks; revert default ctor
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/11040/files
- new: https://git.openjdk.org/jdk/pull/11040/files/a9327485..8657b2d4
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=11040&range=01
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=11040&range=00-01
Stats: 35 lines in 5 files changed: 20 ins; 2 del; 13 mod
Patch: https://git.openjdk.org/jdk/pull/11040.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/11040/head:pull/11040
PR: https://git.openjdk.org/jdk/pull/11040
More information about the hotspot-dev
mailing list