RFR: JDK-8296437: NMT incurs costs if disabled
Thomas Stuefe
stuefe at openjdk.org
Tue Nov 8 17:58:06 UTC 2022
While investigating the performance of the os::malloc wrapper, I noticed that we spend a lot of cycles copying empty callstacks around, even if NMT is disabled.
The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :
#define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
NativeCallStack(0) : NativeCallStack::empty_stack())
#define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
NativeCallStack(1) : NativeCallStack::empty_stack())
and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:
void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).
However, that does not work as intended. The C++ compiler - at least gcc on linux - interprets these as copy-by-value and generates code that always laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads, and two 16byte moves (the NMT stacks are by default 4 frames, so 4 pointer-sized slots):
0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
...
# Load tracking level
cb9a77: 48 8d 1d 02 35 78 00 lea 0x783502(%rip),%rbx # 143cf80 <_ZN10MemTracker15_tracking_levelE>
cb9a7e: 8b 03 mov (%rbx),%eax
# detail (3) tracking?
cb9a80: 83 f8 03 cmp $0x3,%eax
# yes: go and collect callstack
cb9a83: 0f 84 57 01 00 00 je cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
# no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
cb9a89: 48 8d 05 30 44 78 00 lea 0x784430(%rip),%rax # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
cb9a90: f3 0f 6f 00 movdqu (%rax),%xmm0
cb9a94: f3 0f 6f 48 10 movdqu 0x10(%rax),%xmm1
cb9a99: 0f 11 45 c0 movups %xmm0,-0x40(%rbp)
cb9a9d: 0f 11 4d d0 movups %xmm1,-0x30(%rbp)
...
# do the actual malloc:
cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc at plt>
# call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
cb9b0f: 48 8d 4d c0 lea -0x40(%rbp),%rcx
...
cb9b19: e8 f2 b7 f3 ff callq bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.
---------------------
The patch changes the macros to avoid initialization of `NativeCallStack` if NMT is off or in summary mode only.
This was a bit tricky to do, since I wanted the compiler to not do anything if NMT is disabled, and of course I did not want to change the semantics of CALLER|CURRENT_PC.
In the end I settled for exchanging the explicit calls to `NativeCallStack::empty_stack()` to calls to the default constructor. I changed the default constructor to a no-op. So the NativeCallStack object is not initialized, the compiler optimizes the empty constructor call away. In NMT=off, we are done; in NMT=summary mode, we now just hand down the pointer to the uninitialized NativeCallStack to MallocTracker::record_malloc(), which will ignore it anyway:
0000000000cb98f0 <_ZN2os6mallocEm8MEMFLAGS>:
...
# load tracking level
cb9907: 48 8d 1d 72 46 78 00 lea 0x784672(%rip),%rbx # 143df80 <_ZN10MemTracker15_tracking_levelE>
cb990e: 8b 03 mov (%rbx),%eax
# detail (3) tracking?
cb9910: 83 f8 03 cmp $0x3,%eax
# yes: go and collect callstack
cb9913: 0f 84 37 01 00 00 je cb9a50 <_ZN2os6mallocEm8MEMFLAGS+0x160>
# no: nothing more to do ...
...
# do the actual malloc:
cb9af8: e8 c3 40 5d ff callq 28dbc0 <malloc at plt>
...
# call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX). The stack remains uninitialized, that is fine, since the MallocTracker will ignore it anyway:
cb9987: 48 8d 4d c0 lea -0x40(%rbp),%rcx
..
cb9991: e8 ba b8 f3 ff callq bf5250 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
There were only two callers of the default constructor that used it, and I changed them to use `NativeCallStack ncs(NULL, 0);` which is functionally equivalent.
--------------
Results:
When profiling, I see os::malloc now needs less cycles, and the hotspot around the xmm instructions is not there anymore.
-------------
Commit messages:
- JDK-8296437-CURRENT_PC_costly-even-if-NMT-off
Changes: https://git.openjdk.org/jdk/pull/11040/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=11040&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8296437
Stats: 20 lines in 5 files changed: 11 ins; 0 del; 9 mod
Patch: https://git.openjdk.org/jdk/pull/11040.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/11040/head:pull/11040
PR: https://git.openjdk.org/jdk/pull/11040
More information about the hotspot-dev
mailing list