RFR: JDK-8296437: NMT incurs costs if disabled

Thomas Stuefe stuefe at openjdk.org
Sat Nov 12 07:12:39 UTC 2022


On Thu, 10 Nov 2022 20:41:49 GMT, Ioi Lam <iklam at openjdk.org> wrote:

>> While investigating the performance of the os::malloc wrapper, I noticed that we spend a lot of cycles copying empty callstacks around, even if NMT is disabled.
>> 
>> The CURRENT_PC and CALLER_PC macros are used to create `NativeCallStack` objects out of thin air :
>> 
>> 
>> #define CURRENT_PC ((MemTracker::tracking_level() == NMT_detail) ? \
>>                     NativeCallStack(0) : NativeCallStack::empty_stack())
>> #define CALLER_PC ((MemTracker::tracking_level() == NMT_detail) ? \
>>                     NativeCallStack(1) : NativeCallStack::empty_stack())
>> 
>> 
>> and feed them to a callee routine, which usually has the argument defined via const reference, e.g. os::malloc:
>> 
>> 
>> void* os::malloc(size_t size, MEMFLAGS memflags, const NativeCallStack& stack);
>> 
>> 
>> In CURRENT|CALLER_PC, the left hand of the ':' operator handles the detail mode, when we actually do collect a stack. In that case, the stack sits on the thread stack as an automatic anonymous variable and is filled by the stack walker. The right-hand of ':' handles the case when we don't want a stack. In that case, the intent is to hand down the reference to a pre-created "empty stack" singleton (NativeCallStack::empty_stack()).
>> 
>> However, that does not work as intended. The C++ compiler - at least gcc on linux - interprets these as copy-by-value and generates code that always laboriously copies the content of the empty stack singleton onto the thread stack. It uses four SSE instructions - two 16byte loads, and two 16byte moves (the NMT stacks are by default 4 frames, so 4 pointer-sized slots):
>> 
>> 
>> 0000000000cb9a60 <_ZN2os6mallocEm8MEMFLAGS>:
>> ...
>> # Load tracking level
>>   cb9a77:	48 8d 1d 02 35 78 00 	lea    0x783502(%rip),%rbx        # 143cf80 <_ZN10MemTracker15_tracking_levelE>
>>   cb9a7e:	8b 03                	mov    (%rbx),%eax
>> # detail (3) tracking?  
>>   cb9a80:	83 f8 03             	cmp    $0x3,%eax
>> # yes: go and collect callstack    
>>   cb9a83:	0f 84 57 01 00 00    	je     cb9be0 <_ZN2os6mallocEm8MEMFLAGS+0x180>
>> # no: copy the content of NativeCallStack::_empty_stack to the local stack, in 16 byte intervals:
>>   cb9a89:	48 8d 05 30 44 78 00 	lea    0x784430(%rip),%rax        # 143dec0 <_ZN15NativeCallStack12_empty_stackE>
>>   cb9a90:	f3 0f 6f 00          	movdqu (%rax),%xmm0
>>   cb9a94:	f3 0f 6f 48 10       	movdqu 0x10(%rax),%xmm1
>>   cb9a99:	0f 11 45 c0          	movups %xmm0,-0x40(%rbp)
>>   cb9a9d:	0f 11 4d d0          	movups %xmm1,-0x30(%rbp)
>>   ...  
>> # do the actual malloc:
>>   cb9af8:	e8 c3 40 5d ff       	callq  28dbc0 <malloc at plt>
>> 
>> # call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX):
>>   cb9b0f:	48 8d 4d c0          	lea    -0x40(%rbp),%rcx
>>   ...
>>   cb9b19:	e8 f2 b7 f3 ff       	callq  bf5310 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>  
>> 
>> 
>> This is completely unnecessary, since if NMT mode != detail, the stack is never used. This hits every call site where these macros are used, and we pay if NMT is disabled.
>> 
>> ---------------------
>> 
>> The patch changes the macros to avoid initialization of `NativeCallStack` if NMT is off or in summary mode only.
>> 
>> This was a bit tricky to do, since I wanted the compiler to not do anything if NMT is disabled, and of course I did not want to change the semantics of CALLER|CURRENT_PC.
>> 
>> In the end I settled for exchanging the explicit calls to `NativeCallStack::empty_stack()` to calls to the default constructor. I changed the default constructor to a no-op. So the NativeCallStack object is not initialized, the compiler optimizes the empty constructor call away. In NMT=off, we are done; in NMT=summary mode, we now just hand down the pointer to the uninitialized NativeCallStack to MallocTracker::record_malloc(), which will ignore it anyway:
>> 
>> 
>> 0000000000cb98f0 <_ZN2os6mallocEm8MEMFLAGS>:
>> ...
>> # load tracking level
>>   cb9907:	48 8d 1d 72 46 78 00 	lea    0x784672(%rip),%rbx        # 143df80 <_ZN10MemTracker15_tracking_levelE>
>>   cb990e:	8b 03                	mov    (%rbx),%eax
>> # detail (3) tracking?  
>>   cb9910:	83 f8 03             	cmp    $0x3,%eax
>> # yes: go and collect callstack  
>>   cb9913:	0f 84 37 01 00 00    	je     cb9a50 <_ZN2os6mallocEm8MEMFLAGS+0x160>
>> # no: nothing more to do ...
>>   ...
>> # do the actual malloc:
>>   cb9af8:	e8 c3 40 5d ff       	callq  28dbc0 <malloc at plt>
>> ...
>> # call MallocTracker::record_malloc() and hand down pointer to NMT stack (4th argument->RCX). The stack remains uninitialized, that is fine, since the MallocTracker will ignore it anyway:
>>   cb9987:	48 8d 4d c0          	lea    -0x40(%rbp),%rcx 
>> ..  
>>   cb9991:	e8 ba b8 f3 ff       	callq  bf5250 <_ZN13MallocTracker13record_mallocEPvm8MEMFLAGSRK15NativeCallStack>
>> 
>> 
>> There were only two callers of the default constructor that used it, and I changed them to use `NativeCallStack ncs(NULL, 0);` which is functionally equivalent.
>> 
>> --------------
>> 
>> Results:
>> 
>> When profiling, I see os::malloc now needs less cycles, and the hotspot around the xmm instructions is not there anymore.
>
> (I commented before but for some reason it's been lost). 
> 
> There's one use of the default constructor that you've missed (I found that by removing the body of the NativeCallStack() constructor):
> 
> 
>   ReservedMemoryRegion(const ReservedMemoryRegion& rr) :
>     VirtualMemoryRegion(rr.base(), rr.size()) {
>     *this = rr;
>   }
> 
> 
> I think it will be much safer to leave the existing default constructor, and have something like:
> 
> 
> private:
> NativeCallStack(int dummy) {
>   _dummy[0] = NULL;
> }
> 
> public:
> inline static NativeCallStack fake_stack() {
>   NativeCallStack fake(0);
>   return fake;
> }
> 
> 
> This will keep the behavior the same as before. Note that your patch will change the behavior if the fake stack is actually used. E.g., for this function:
> 
> 
>   inline bool is_empty() const {
>     return _stack[0] == NULL;
>   }
> 
> 
> If you are absolutely sure that the fake stacks are never used, and really want to get rid of the `_stack[0] = NULL`,  I would suggest adding a new debug-only field, and add asserts like this in all public functions:
> 
> 
>   inline bool is_empty() const {
>     assert(!is_fake, "sanity");
>     return _stack[0] == NULL;
>   }

After thinking about @iklam s feedback, I changed the patch and introduced an explicit constructor to build fake callstacks for CALLER_PC and CURRENT_PC. That leaves the current mechanics of empty callstacks unchanged. 

The fake constructor will leave the object uninitialized in release builds, so no construction costs there. It will zap it with a test pattern in debug builds. That test pattern will be asserted to make sure we don't accidentally use them.

I refrained from making any further changes, but plan to do future improvements in follow up RFEs.

-------------

PR: https://git.openjdk.org/jdk/pull/11040


More information about the hotspot-dev mailing list