RFR: 8334060: Implementation of Late Barrier Expansion for G1
Roberto Castañeda Lozano
rcastanedalo at openjdk.org
Mon Jun 17 12:13:51 UTC 2024
This changeset implements JEP 475 (Late Barrier Expansion for G1), including support for the x64 and aarch64 platforms. See the [JEP description](https://openjdk.org/jeps/475) for further detail.
We aim to integrate this work in JDK 24. The purpose of this pull request is double-fold:
- to allow maintainers of the arm (32-bit), ppc, riscv, s390, and x86 (32-bit) ports to contribute a port of these platforms in time for JDK 24; and
- to allow reviewers to review the platform-independent, x64 and aarch64, and test changes in parallel with the porting work.
## Summary of the Changes
### Platform-Independent Changes (`src/hotspot/share`)
These consist mainly of:
- a complete rewrite of `G1BarrierSetC2`, to instruct C2 to expand G1 barriers late instead of early;
- a few minor changes to C2 itself, to support removal of redundant decompression operations and to address an OopMap construction issue triggered by this JEP's increased usage of ADL `TEMP` operands; and
- temporary support for porting the JEP to the remaining platforms.
The temporary support code (guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT`) will **not** be part of the final pull request, and hence does not need to be reviewed.
### Platform-Dependent Changes (`src/hotspot/cpu`)
These include changes to the ADL instruction definitions and the `G1BarrierSetAssembler` class of the x64 and aarch64 platforms.
#### ADL Changes
The changeset uses ADL predicates to force C2 to implement memory accesses tagged with barrier information using G1-specific, barrier-aware instruction versions (e.g. `g1StoreP` instead of the GC-agnostic `storeP`). These new instruction versions generate machine code accordingly to the corresponding tagged barrier information, relying on the G1 barrier implementations provided by the `G1BarrierSetAssembler` class. In the aarch64 platform, the bulk of the ADL code is generated from a higher-level version using m4, to reduce redundancy.
#### `G1BarrierSetAssembler` Changes
Both platforms basically reuse the barrier implementation for the bytecode interpreter, with the different barrier tests and operations refactored into dedicated functions. Besides this, `G1BarrierSetAssembler` is extended with assembly-stub routines that implement the out-of-line, slow path of the barriers. These routines include calls from the barrier into the JVM, which require support for saving and restoring live registers, provided by the `SaveLiveRegisters` class. This class is already available in all platforms that support ZGC (see [JDK-8330685](https://bugs.openjdk.org/browse/JDK-8330685)).
### Test Changes (`test`)
The changeset includes:
- a comprehensive set of tests that verify, using the [IR Test Framework](https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/README.md), that barriers are generated and optimized as expected in different scenarios and configurations;
- a test that triggers a latent issue in C2's OopMap building logic (addressed by this changeset), where undefined values generated by C2's implementation of ADL `TEMP` operands are included in OopMaps;
- removal of memory limits in tests where, before this changeset, C2 would rather bail out silently;
- adjustment of the OptoAssembly output expected by `compiler.c2.aarch64.TestVolatiles`; and
- relaxation of the expectations in a case of `compiler.c2.irTests.scalarReplacement.AllocationMergesTests` where C2 is now able to reduce an allocation.
## Notes for Port Maintainers
Porting this JEP to a different platform involves the following tasks:
- Predicate all existing ADL instructions that may match a C2 memory access operation so that the match is only enabled if the operations do not include barrier information (`barrier_data() == 0`). The relevant memory access operations are (for `X` in `{P, N}`): `StoreX`, `CompareAndExchangeX`, `CompareAndSwapX`, `WeakCompareAndSwapX`, `GetAndSetX`, and `LoadX`.
- Create a new ADL file (with suggested name `src/hotspot/cpu/$PLATFORM/gc/g1/g1_$PLATFORM.ad`) where all G1-specific memory access instructions are defined and predicated with (`UseG1GC && barrier_data() != 0`). It is important to use the same instruction naming convention for all platforms (`g1StoreX`, `g1CompareAndExchangeX`, etc.), to support running the new IR Test Framework tests.
The instruction implementations are responsible for generating the appropriate barrier code as well as the memory access itself. Generating the barrier code typically involves reserving temporary registers to support OOP encoding/decoding and the barrier operations themselves, creating a `G1PreBarrierStubC2` or `G1PostBarrierStubC2`, notifying the stub object which registers are and aren't live at the barrier point (besides those that are live out of the entire instruction), and calling into `G1BarrierSetAssembler` to generate the actual barrier machine code.
- Generalize, and possibly refactor, the logic already existing in `G1BarrierSetAssembler` to generate barrier code from the newly introduced ADL instructions.
- Implement `G1BarrierSetAssembler::generate_c2_pre_barrier_stub()` and `G1BarrierSetAssembler::generate_c2_post_barrier_stub()` to generate the out-of-line, slow path of the barriers. The slow path includes a call into the JVM, which is supported by the `SaveLiveRegisters` class, typically implemented in `src/hotspot/cpu/$PLATFORM/gc/shared/barrierSetAssembler_$PLATFORM.*`.
- For the arm (32-bit), s390, and x86 (32-bit) platforms, implement the `SaveLiveRegisters` class (or similar support). Essentially, this class implements saving and restoring registers given by the barrier stub class, according to the calling convention of the platform.
- Optionally, extend `Matcher::pd_clone_node()` and introduce an additional `g1EncodePAndStoreN` ADL instruction to support the removal of redundant decompression operations (see the [JEP description](https://openjdk.org/jeps/475), subsection "Candidate optimizations").
To ease these tasks, this changeset includes the following additional functionality, which is guarded by the pre-processor flag `G1_LATE_BARRIER_MIGRATION_SUPPORT` and will be removed after all ports are merged in this pull request and before integration:
- `G1UseLateBarrierExpansion` JVM flag to enable/disable late barrier expansion. This flag can be useful to diagnose issues in the generated barrier code by comparing to that generated for early barrier expansion.
- `G1StressBarriers` JVM flag to run G1 under an extreme configuration that exercises otherwise rarely executed barrier paths. This flag can be useful to improve test coverage and find low-frequency bugs in the barrier implementation.
## Testing
### Functionality
| tests | default configuration | `-XX:-UseCompressedOops` | `-XX:+G1StressBarriers` | `-XX:-UseCompressedOops -XX:+G1StressBarriers` |
| ----- | --------------------- | ------------------------ | ----------------------- | ---------------------------------------------- |
| tier1-tier3 | all Oracle-supported platforms * | all Oracle-supported platforms | all Oracle-supported platforms | all Oracle-supported platforms |
| tier4-tier5 | all Oracle-supported platforms | linux-x64, linux-aarch64 | linux-x64, linux-aarch64 | - |
| tier6-tier8 | all Oracle-supported platforms | - | - | - |
| jcstress | linux-x64, linux-aarch64 | linux-x64, linux-aarch64 | linux-x64, linux-aarch64 | linux-x64, linux-aarch64 |
* _all Oracle-supported platforms_: linux-x64, windows-x64, macosx-x64, linux-aarch64, macosx-aarch64
### C2 Execution Time
On average, this changeset reduces C2's execution time, when using the G1 collector, by 15% (on x64) and 18% (on aarch64) across all [DaCapo 23.11-chopin](https://www.dacapobench.org/) benchmarks. C2's execution time is measured, on both compared JVM versions, using the HotSpot options `-Xbatch -XX:-TieredCompilation -XX:+CITime`.
### Quality of C2-Generated Code
#### Speed
Over a wide range of standard benchmark suites (including DaCapo 9.12-bach, DaCapo 23.11-chopin, Renaissance, SPECjbb2015, and SPECjvm2008) run across all Oracle-supported platforms, this changeset yields 101 statistically significant speedups (including four double-digit speedups, up to 17%) and 16 statistically significant regressions (down to -6%). This general speedup can be explained by a combination of the following three factors: a positive net effect in the compiler inlining heuristics (due to differences in how these handle barrier code), the effect of a lower C2 overhead in short-running and/or CPU-saturating benchmarks, and, to a lesser extent, C2 optimizations enabled by late barrier expansion (such as [barrier elision](https://mail.openjdk.org/pipermail/hotspot-gc-dev/2024-March/046283.html) and [allocation reduction](https://github.com/openjdk/jdk/compare/master...robcasloz:jdk:JDK-8334060-g1-late-barrier-expansion#diff-bd6f2e9621b343c4322e740e187e1874dd809bfb51af621ae5
782522bc17ae8fR1358-R1363)).
#### Size
The changeset does not cause any statistically significant difference in the size of the code generated by C2 for any [DaCapo 23.11-chopin](https://www.dacapobench.org/) benchmark on the x64 or aarch64 platforms. Size of C2-generated code is measured using the same Hotspot options as for C2 execution time, and compared after normalizing to the total number of bytecodes compiled (i.e. the unit of comparison is C2-compiled bytes per C2-compiled bytecode).
### Comprehensibility to Non-C2 Developers
In parallel with the JEP implementation, two Oracle GC engineers have successfully and independently prototyped non-trivial G1 barrier enhancements using the late barrier expansion model. According to their feedback, these enhancements would have been significantly harder to prototype, if at all possible, using early barrier expansion without requiring assistance from a C2 engineer.
-------------
Commit messages:
- Implement JEP 475
Changes: https://git.openjdk.org/jdk/pull/19746/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=19746&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8334060
Stats: 3656 lines in 35 files changed: 3355 ins; 174 del; 127 mod
Patch: https://git.openjdk.org/jdk/pull/19746.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/19746/head:pull/19746
PR: https://git.openjdk.org/jdk/pull/19746
More information about the hotspot-dev
mailing list