RFR: JDK-8314890: Reduce number of loads for Klass decoding in static code [v12]

Wed Jan 17 12:48:57 UTC 2024

On Wed, 15 Nov 2023 14:50:48 GMT, Thomas Stuefe <stuefe at openjdk.org> wrote:

>> Small change that reduces the number of loads generated by the C++ compiler for a narrow Klass decoding operation (`CompressedKlassPointers::decode_xxx()`.
>> 
>> Stock: three loads (with two probably sharing a cache line) - UseCompressedClassPointers, encoding base and shift.
>> 
>> 
>>   8b7b62:   48 8d 05 7f 1b c3 00    lea    0xc31b7f(%rip),%rax        # 14e96e8 <UseCompressedClassPointers>
>>   8b7b69:   0f b6 00                movzbl (%rax),%eax
>>   8b7b6c:   84 c0                   test   %al,%al
>>   8b7b6e:   0f 84 9c 00 00 00       je     8b7c10 <_ZN10HeapRegion14object_iterateEP13ObjectClosure+0x260>
>>   8b7b74:   48 8d 15 05 62 c6 00    lea    0xc66205(%rip),%rdx        # 151dd80 <_ZN23CompressedKlassPointers6_shiftE>
>>   8b7b7b:   8b 7b 08                mov    0x8(%rbx),%edi
>>   8b7b7e:   8b 0a                   mov    (%rdx),%ecx
>>   8b7b80:   48 8d 15 01 62 c6 00    lea    0xc66201(%rip),%rdx        # 151dd88 <_ZN23CompressedKlassPointers5_baseE>
>>   8b7b87:   48 d3 e7                shl    %cl,%rdi
>>   8b7b8a:   48 03 3a                add    (%rdx),%rdi
>> 
>> 
>>   Patched: one load loads all three. Since shift occupies the lowest 8 bits, compiled code uses 8bit register; ditto the UseCompressedOops flag.
>> 
>> 
>>   8ba302:   48 8d 05 97 9c c2 00    lea    0xc29c97(%rip),%rax        # 14e3fa0 <_ZN23CompressedKlassPointers6_comboE>
>>   8ba309:   48 8b 08                mov    (%rax),%rcx
>>   8ba30c:   f6 c5 01                test   $0x1,%ch        # use compressed klass pointers?
>>   8ba30f:   0f 84 9b 00 00 00       je     8ba3b0 <_ZN10HeapRegion14object_iterateEP13ObjectClosure+0x260>
>>   8ba315:   8b 7b 08                mov    0x8(%rbx),%edi
>>   8ba318:   48 d3 e7                shl    %cl,%rdi        # shift 
>>   8ba31b:   66 31 c9                xor    %cx,%cx         # zero out lower 16 bits of base
>>   8ba31e:   48 01 cf                add    %rcx,%rdi       # add base 
>>   8ba321:   8b 4f 08                mov    0x8(%rdi),%ecx
>> 
>> ---
>> 
>> Performance measurements: 
>> 
>> G1, doing a full GC over a heap filled with 256 mio life j.l.Object instances.
>> 
>> I see a reduction of Full Pause times between 1.2% and 5%. I am unsure how reliable these numbers are since, despite my efforts (running tests on isolated CPUs etc.), the standard deviation was quite high at ˜4%. Still, in general, numbers seemed to go down rather than up.
>> 
>> ---
>> 
>> Future extensions: 
>> 
>> This patch uses the fact that the encoding base is aligned to metaspace reser...
>
> Thomas Stuefe has updated the pull request incrementally with three additional commits since the last revision:
> 
>  - Update src/hotspot/share/oops/compressedKlass.cpp
>    
>    Co-authored-by: Aleksey Shipilëv <shipilev at amazon.de>
>  - Update src/hotspot/share/oops/compressedKlass.cpp
>    
>    Co-authored-by: Aleksey Shipilëv <shipilev at amazon.de>
>  - Update src/hotspot/share/oops/compressedKlass.hpp
>    
>    Co-authored-by: Aleksey Shipilëv <shipilev at amazon.de>

> (And regarding the bitfield discussion - I _would_ like to handle object header fields as bitfield, eventually. Not sure if it makes sense though or if we ever get there...)

Needed to remind myself that bitfields are actually funky under C++ memory model: the adjacent bitfields of non-zero length are basically taken as a [single memory location](https://en.cppreference.com/w/cpp/language/memory_model#Memory_location), which raises all sort of questions what happens under concurrent access to them. Let's avoid walking into that mess :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/15389#issuecomment-1895739624