RFR: 8334060: Implementation of Late Barrier Expansion for G1 [v2]
Martin Doerr
mdoerr at openjdk.org
Sun Jul 21 08:39:39 UTC 2024
On Thu, 20 Jun 2024 04:17:30 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:
>> This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail.
>>
>> We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold:
>>
>> - to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and
>> - to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work.
>>
>> ## Summary of the Changes
>>
>> ### Platform-Independent Changes (`src/hotspot/share`)
>>
>> These consist mainly of:
>>
>> - a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early;
>> - a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and
>> - temporary support for porting the JEP to the remaining platforms.
>>
>> The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed.
>>
>> ### Platform-Dependent Changes (`src/hotspot/cpu`)
>>
>> These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms.
>>
>> #### ADL Changes
>>
>> The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy.
>>
>> #### `G1BarrierSetAssembler` Changes
>>
>> Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live ...
>
> Roberto Castañeda Lozano has updated the pull request incrementally with one additional commit since the last revision:
>
> Build barrier data in G1BarrierSetC2::get_store_barrier() by adding, rather than removing, barrier tags
I have looked at the x86 implementation and I have some performance tuning ideas. Please take a look. I guess at least some of your code is performance critical.
src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 86:
> 84: // an indirect memory operand) to reduce C2's scheduling and register
> 85: // allocation pressure (fewer Mach nodes). The same holds for g1StoreN and
> 86: // g1EncodePAndStoreN.
I'm not convinced that this is beneficial. We're wasting a temp register just for an addition?
src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 123:
> 121: if ((barrier_data() & G1C2BarrierPost) != 0) {
> 122: __ movl($tmp2$$Register, $src$$Register);
> 123: if ((barrier_data() & G1C2BarrierPostNotNull) == 0) {
`decode_heap_oop` contains a null check in some cases which makes some of your code redundant. Optimization idea: In case of `(((barrier_data() & G1C2BarrierPostNotNull) == 0) && CompressedOops::base() != nullptr)` use a null check and bail out because there's nothing left to do if it's null. After that, we can always use `decode_heap_oop_not_null`.
src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 182:
> 180: $tmp2$$Register /* pre_val */,
> 181: $tmp3$$Register /* tmp */,
> 182: RegSet::of($mem$$Register, $newval$$Register, $oldval$$Register) /* preserve */);
The only value which can get overwritten is `oldval`. Optimization idea: Pass `oldval` to the SATB barrier. There is no load of the old value required.
src/hotspot/cpu/x86/gc/g1/g1_x86_64.ad line 301:
> 299: RegSet::of($mem$$Register, $newval$$Register) /* preserve */);
> 300: __ movq($tmp1$$Register, $newval$$Register);
> 301: __ xchgq($newval$$Register, Address($mem$$Register, 0));
Optimization idea: Despite its name, `g1_pre_write_barrier` can be moved after the xchg operation because there's no safepoint within this MachNode. This allows avoiding loading the old value twice.
-------------
PR Review: https://git.openjdk.org/jdk/pull/19746#pullrequestreview-2190271351
PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1685680587
PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1685682308
PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1685683332
PR Review Comment: https://git.openjdk.org/jdk/pull/19746#discussion_r1685683768
More information about the hotspot-gc-dev
mailing list