RFR: JDK-8314890: Reduce number of loads for Klass decoding in static code

Wed Aug 23 14:46:41 UTC 2023

Small change that reduces the number of loads generated by the C++ compiler for a narrow Klass decoding operation (`CompressedKlassPointers::decode_xxx()`.

Stock: three loads (with two probably sharing a cache line) - UseCompressedClassPointers, encoding base and shift.

  8b7b62:   48 8d 05 7f 1b c3 00    lea    0xc31b7f(%rip),%rax        # 14e96e8 <UseCompressedClassPointers>
  8b7b69:   0f b6 00                movzbl (%rax),%eax
  8b7b6c:   84 c0                   test   %al,%al
  8b7b6e:   0f 84 9c 00 00 00       je     8b7c10 <_ZN10HeapRegion14object_iterateEP13ObjectClosure+0x260>
  8b7b74:   48 8d 15 05 62 c6 00    lea    0xc66205(%rip),%rdx        # 151dd80 <_ZN23CompressedKlassPointers6_shiftE>
  8b7b7b:   8b 7b 08                mov    0x8(%rbx),%edi
  8b7b7e:   8b 0a                   mov    (%rdx),%ecx
  8b7b80:   48 8d 15 01 62 c6 00    lea    0xc66201(%rip),%rdx        # 151dd88 <_ZN23CompressedKlassPointers5_baseE>
  8b7b87:   48 d3 e7                shl    %cl,%rdi
  8b7b8a:   48 03 3a                add    (%rdx),%rdi

  Patched: one load loads all three. Since shift occupies the lowest 8 bits, compiled code uses 8bit register; ditto the UseCompressedOops flag.

  8ba302:   48 8d 05 97 9c c2 00    lea    0xc29c97(%rip),%rax        # 14e3fa0 <_ZN23CompressedKlassPointers6_comboE>
  8ba309:   48 8b 08                mov    (%rax),%rcx
  8ba30c:   f6 c5 01                test   $0x1,%ch        # use compressed klass pointers?
  8ba30f:   0f 84 9b 00 00 00       je     8ba3b0 <_ZN10HeapRegion14object_iterateEP13ObjectClosure+0x260>
  8ba315:   8b 7b 08                mov    0x8(%rbx),%edi
  8ba318:   48 d3 e7                shl    %cl,%rdi        # shift 
  8ba31b:   66 31 c9                xor    %cx,%cx         # zero out lower 16 bits of base
  8ba31e:   48 01 cf                add    %rcx,%rdi       # add base 
  8ba321:   8b 4f 08                mov    0x8(%rdi),%ecx

---

Performance measurements: 

G1, doing a full GC over a heap filled with 256 mio life j.l.Object instances.

I see a reduction of Full Pause times between 1.2% and 5%. I am unsure how reliable these numbers are since, despite my efforts (running tests on isolated CPUs etc.), the standard deviation was quite high at ˜4%. Still, in general, numbers seemed to go down rather than up.

---

Future extensions: 

This patch uses the fact that the encoding base is aligned to metaspace reserve alignment (16 Mb). We only use 16 of those 24 bits of alignment shadow and could use more.

A future UseCOH flag can easily nestle alongside the UseCCP bit in the second byte, and then we could test that second byte of the combo for either 0,1 or 3 (all possible valid permutations of these flags).

-------------

Commit messages:
 - use 16 bit alignment
 - with raw bit ops

Changes: https://git.openjdk.org/jdk/pull/15389/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=15389&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8314890
  Stats: 45 lines in 3 files changed: 35 ins; 0 del; 10 mod
  Patch: https://git.openjdk.org/jdk/pull/15389.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/15389/head:pull/15389

PR: https://git.openjdk.org/jdk/pull/15389