RFR: 8309953: Strengthen and optimize oopDesc age methods

Wed Jun 14 10:02:56 UTC 2023

On Tue, 13 Jun 2023 20:04:48 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

> See the RFE for discussion. Basically, there is little reason to do two loads of mark word, when we can do one. 
> 
> Additional testing:
>  - [x] Eyeballing generated code
>  - [x] Linux x86_64 fastdebug `tier1 tier2 tier3`
>  - [x] Linux AArch64 fastdebug `tier1 tier2 tier3`

Moving perf observations in a separate comment.

Sample generated code for `oopDesc::age` can be seen if we turn that method from `inline` to the regular method:

# Before

000000000080f440 <oopDesc::age>:
  80f440: ff 83 00 d1   sub     sp, sp, #32
  80f444: fd 7b 01 a9   stp     x29, x30, [sp, #16]
  80f448: fd 43 00 91   add     x29, sp, #16
  80f44c: 08 00 40 f9   ldr     x8, [x0]          ; <-- first mark load
  80f450: 89 27 00 d0   adrp    x9, 0xd01000 
  80f454: 1f 20 03 d5   nop     
  80f458: 29 95 4a b9   ldr     w9, [x9, #2708]
  80f45c: 0a 05 40 92   and     x10, x8, #0x3
  80f460: 5f 09 00 f1   cmp     x10, #2
  80f464: ea 17 9f 1a   cset    w10, eq
  80f468: 1f 01 40 f2   tst     x8, #0x1
  80f46c: e8 17 9f 1a   cset    w8, eq
  80f470: 3f 09 00 71   cmp     w9, #2
  80f474: 48 01 88 1a   csel    w8, w10, w8, eq
  80f478: 1f 05 00 71   cmp     w8, #1
  80f47c: 21 01 00 54   b.ne    0x80f4a0
  80f480: 08 00 40 f9   ldr     x8, [x0]          ; <-- second mark load
  80f484: e8 07 00 f9   str     x8, [sp, #8]
  80f488: e0 23 00 91   add     x0, sp, #8
  80f48c: c4 ed fd 97   bl      0x78ab9c
  80f490: 00 18 03 53   ubfx    w0, w0, #3, #4
  80f494: fd 7b 41 a9   ldp     x29, x30, [sp, #16]
  80f498: ff 83 00 91   add     sp, sp, #32
  80f49c: c0 03 5f d6   ret     
  80f4a0: 00 00 40 f9   ldr     x0, [x0]
  80f4a4: 00 18 03 53   ubfx    w0, w0, #3, #4
  80f4a8: fd 7b 41 a9   ldp     x29, x30, [sp, #16]
  80f4ac: ff 83 00 91   add     sp, sp, #32
  80f4b0: c0 03 5f d6   ret    

# After

000000000080f480 <oopDesc::age>:
  80f480: ff 83 00 d1   sub     sp, sp, #32
  80f484: fd 7b 01 a9   stp     x29, x30, [sp, #16]
  80f488: fd 43 00 91   add     x29, sp, #16
  80f48c: 00 00 40 f9   ldr     x0, [x0]          ; <-- load mark once
  80f490: e0 07 00 f9   str     x0, [sp, #8]
  80f494: 88 27 00 d0   adrp    x8, 0xd01000  
  80f498: 1f 20 03 d5   nop     
  80f49c: 08 95 4a b9   ldr     w8, [x8, #2708]
  80f4a0: 09 04 40 92   and     x9, x0, #0x3
  80f4a4: 3f 09 00 f1   cmp     x9, #2
  80f4a8: e9 17 9f 1a   cset    w9, eq
  80f4ac: 1f 00 40 f2   tst     x0, #0x1
  80f4b0: ea 17 9f 1a   cset    w10, eq
  80f4b4: 1f 09 00 71   cmp     w8, #2
  80f4b8: 28 01 8a 1a   csel    w8, w9, w10, eq
  80f4bc: 1f 05 00 71   cmp     w8, #1
  80f4c0: 61 00 00 54   b.ne    0x80f4cc 
  80f4c4: e0 23 00 91   add     x0, sp, #8
  80f4c8: c5 ed fd 97   bl      0x78abdc
  80f4cc: 00 18 03 53   ubfx    w0, w0, #3, #4
  80f4d0: fd 7b 41 a9   ldp     x29, x30, [sp, #16]
  80f4d4: ff 83 00 91   add     sp, sp, #32 
  80f4d8: c0 03 5f d6   ret     

Note how the method suffix also folds, which saves about 24 bytes in instruction stream.

On Linux x86_64, it gives as an edge, for a benchmark that stresses a Serial Full GC:

public class Retain {
	static final int RETAINED = Integer.getInteger("retained", 10_000_000);
	static final int GCS      = Integer.getInteger("gcs", 100);

	static Object[] OBJECTS = new Object[RETAINED];

	public static void main(String... args) {
		for (int t = 0; t < GCS; t++) {
			for (int c = 0; c < RETAINED; c++) {
				OBJECTS[c] = new Object();
			}
			System.gc();
		}
	}
}

# build/baseline/bin/java -Xmx1g -Xlog:gc -XX:+UseSerialGC Retain.java
...
[36.315s][info][gc] GC(87) Pause Full (System.gc()) 349M->193M(989M) 372.745ms
[36.721s][info][gc] GC(88) Pause Full (System.gc()) 349M->193M(989M) 372.359ms
[37.127s][info][gc] GC(89) Pause Full (System.gc()) 349M->193M(989M) 372.485ms
[37.533s][info][gc] GC(90) Pause Full (System.gc()) 349M->193M(989M) 372.297ms
[37.939s][info][gc] GC(91) Pause Full (System.gc()) 349M->193M(989M) 372.208ms
[38.345s][info][gc] GC(92) Pause Full (System.gc()) 349M->193M(989M) 372.413ms
[38.751s][info][gc] GC(93) Pause Full (System.gc()) 349M->193M(989M) 372.380ms
[39.157s][info][gc] GC(94) Pause Full (System.gc()) 349M->193M(989M) 372.505ms
[39.562s][info][gc] GC(95) Pause Full (System.gc()) 349M->193M(989M) 372.383ms
...

#  build/linux-x86_64-server-release/images/jdk/bin/java -Xmx1g -Xlog:gc -XX:+UseSerialGC Retain.java
...
[37.331s][info][gc] GC(91) Pause Full (System.gc()) 349M->193M(989M) 366.610ms
[37.731s][info][gc] GC(92) Pause Full (System.gc()) 349M->193M(989M) 366.164ms
[38.130s][info][gc] GC(93) Pause Full (System.gc()) 349M->193M(989M) 365.986ms
[38.530s][info][gc] GC(94) Pause Full (System.gc()) 349M->193M(989M) 365.886ms
[38.930s][info][gc] GC(95) Pause Full (System.gc()) 349M->193M(989M) 365.634ms
[39.330s][info][gc] GC(96) Pause Full (System.gc()) 349M->193M(989M) 366.460ms
[39.729s][info][gc] GC(97) Pause Full (System.gc()) 349M->193M(989M) 366.118ms
...

So, about +2% faster Full GC, for a little change :)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14456#issuecomment-1590882906