RFR: 8309953: Strengthen and optimize oopDesc age methods
Aleksey Shipilev
shade at openjdk.org
Wed Jun 14 10:02:56 UTC 2023
On Tue, 13 Jun 2023 20:04:48 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
> See the RFE for discussion. Basically, there is little reason to do two loads of mark word, when we can do one.
>
> Additional testing:
> - [x] Eyeballing generated code
> - [x] Linux x86_64 fastdebug `tier1 tier2 tier3`
> - [x] Linux AArch64 fastdebug `tier1 tier2 tier3`
Moving perf observations in a separate comment.
Sample generated code for `oopDesc::age` can be seen if we turn that method from `inline` to the regular method:
# Before
000000000080f440 <oopDesc::age>:
80f440: ff 83 00 d1 sub sp, sp, #32
80f444: fd 7b 01 a9 stp x29, x30, [sp, #16]
80f448: fd 43 00 91 add x29, sp, #16
80f44c: 08 00 40 f9 ldr x8, [x0] ; <-- first mark load
80f450: 89 27 00 d0 adrp x9, 0xd01000
80f454: 1f 20 03 d5 nop
80f458: 29 95 4a b9 ldr w9, [x9, #2708]
80f45c: 0a 05 40 92 and x10, x8, #0x3
80f460: 5f 09 00 f1 cmp x10, #2
80f464: ea 17 9f 1a cset w10, eq
80f468: 1f 01 40 f2 tst x8, #0x1
80f46c: e8 17 9f 1a cset w8, eq
80f470: 3f 09 00 71 cmp w9, #2
80f474: 48 01 88 1a csel w8, w10, w8, eq
80f478: 1f 05 00 71 cmp w8, #1
80f47c: 21 01 00 54 b.ne 0x80f4a0
80f480: 08 00 40 f9 ldr x8, [x0] ; <-- second mark load
80f484: e8 07 00 f9 str x8, [sp, #8]
80f488: e0 23 00 91 add x0, sp, #8
80f48c: c4 ed fd 97 bl 0x78ab9c
80f490: 00 18 03 53 ubfx w0, w0, #3, #4
80f494: fd 7b 41 a9 ldp x29, x30, [sp, #16]
80f498: ff 83 00 91 add sp, sp, #32
80f49c: c0 03 5f d6 ret
80f4a0: 00 00 40 f9 ldr x0, [x0]
80f4a4: 00 18 03 53 ubfx w0, w0, #3, #4
80f4a8: fd 7b 41 a9 ldp x29, x30, [sp, #16]
80f4ac: ff 83 00 91 add sp, sp, #32
80f4b0: c0 03 5f d6 ret
# After
000000000080f480 <oopDesc::age>:
80f480: ff 83 00 d1 sub sp, sp, #32
80f484: fd 7b 01 a9 stp x29, x30, [sp, #16]
80f488: fd 43 00 91 add x29, sp, #16
80f48c: 00 00 40 f9 ldr x0, [x0] ; <-- load mark once
80f490: e0 07 00 f9 str x0, [sp, #8]
80f494: 88 27 00 d0 adrp x8, 0xd01000
80f498: 1f 20 03 d5 nop
80f49c: 08 95 4a b9 ldr w8, [x8, #2708]
80f4a0: 09 04 40 92 and x9, x0, #0x3
80f4a4: 3f 09 00 f1 cmp x9, #2
80f4a8: e9 17 9f 1a cset w9, eq
80f4ac: 1f 00 40 f2 tst x0, #0x1
80f4b0: ea 17 9f 1a cset w10, eq
80f4b4: 1f 09 00 71 cmp w8, #2
80f4b8: 28 01 8a 1a csel w8, w9, w10, eq
80f4bc: 1f 05 00 71 cmp w8, #1
80f4c0: 61 00 00 54 b.ne 0x80f4cc
80f4c4: e0 23 00 91 add x0, sp, #8
80f4c8: c5 ed fd 97 bl 0x78abdc
80f4cc: 00 18 03 53 ubfx w0, w0, #3, #4
80f4d0: fd 7b 41 a9 ldp x29, x30, [sp, #16]
80f4d4: ff 83 00 91 add sp, sp, #32
80f4d8: c0 03 5f d6 ret
Note how the method suffix also folds, which saves about 24 bytes in instruction stream.
On Linux x86_64, it gives as an edge, for a benchmark that stresses a Serial Full GC:
public class Retain {
static final int RETAINED = Integer.getInteger("retained", 10_000_000);
static final int GCS = Integer.getInteger("gcs", 100);
static Object[] OBJECTS = new Object[RETAINED];
public static void main(String... args) {
for (int t = 0; t < GCS; t++) {
for (int c = 0; c < RETAINED; c++) {
OBJECTS[c] = new Object();
}
System.gc();
}
}
}
# build/baseline/bin/java -Xmx1g -Xlog:gc -XX:+UseSerialGC Retain.java
...
[36.315s][info][gc] GC(87) Pause Full (System.gc()) 349M->193M(989M) 372.745ms
[36.721s][info][gc] GC(88) Pause Full (System.gc()) 349M->193M(989M) 372.359ms
[37.127s][info][gc] GC(89) Pause Full (System.gc()) 349M->193M(989M) 372.485ms
[37.533s][info][gc] GC(90) Pause Full (System.gc()) 349M->193M(989M) 372.297ms
[37.939s][info][gc] GC(91) Pause Full (System.gc()) 349M->193M(989M) 372.208ms
[38.345s][info][gc] GC(92) Pause Full (System.gc()) 349M->193M(989M) 372.413ms
[38.751s][info][gc] GC(93) Pause Full (System.gc()) 349M->193M(989M) 372.380ms
[39.157s][info][gc] GC(94) Pause Full (System.gc()) 349M->193M(989M) 372.505ms
[39.562s][info][gc] GC(95) Pause Full (System.gc()) 349M->193M(989M) 372.383ms
...
# build/linux-x86_64-server-release/images/jdk/bin/java -Xmx1g -Xlog:gc -XX:+UseSerialGC Retain.java
...
[37.331s][info][gc] GC(91) Pause Full (System.gc()) 349M->193M(989M) 366.610ms
[37.731s][info][gc] GC(92) Pause Full (System.gc()) 349M->193M(989M) 366.164ms
[38.130s][info][gc] GC(93) Pause Full (System.gc()) 349M->193M(989M) 365.986ms
[38.530s][info][gc] GC(94) Pause Full (System.gc()) 349M->193M(989M) 365.886ms
[38.930s][info][gc] GC(95) Pause Full (System.gc()) 349M->193M(989M) 365.634ms
[39.330s][info][gc] GC(96) Pause Full (System.gc()) 349M->193M(989M) 366.460ms
[39.729s][info][gc] GC(97) Pause Full (System.gc()) 349M->193M(989M) 366.118ms
...
So, about +2% faster Full GC, for a little change :)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/14456#issuecomment-1590882906
More information about the hotspot-dev
mailing list