RFR: 8318446: C2: implement StoreNode::Ideal_merge_stores
Emanuel Peter
epeter at openjdk.org
Tue Jan 16 15:13:32 UTC 2024
This is a feature requiested by @RogerRiggs and @cl4es .
**Idea**
Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup.
Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get
speedups by using
Unsafe (e.g. `Unsafe.putLongUnaligned`), or
ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`).
They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code.
This patch here supports a few simple use-cases, like these:
Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395
Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the
splitting (i.e. shifting and truncation), and directly store the variable:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456
The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472
**Details**
This draft currently implements the optimization in an additional special IGVN phase:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485
We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`).
During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`.
We essentially try to establish a chain of mergable stores:
https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806
Mergable stores must have the same Opcode (implies they have the same element type and hence size).
Further, mergable stores must have the same control (or be separated by only a RangeCheck).
Further, they must either both store constants, or adjacent segments of a larger value (i.e. a larger value right-shifted by a constant offset, see`is_con_RShift`).
Further, we must be able to prove that the stores reference adjacent memory (i.e. the address is shifted by the element size).
For two mergable stores (one `use`, one `def`), the def-store should not have any other use than the use-store,
so that we only merge stores that are in the same basic block. With the only exceptions of merging through
RangeChecks (which can have MergeMem nodes on the memory path, and hence such MergeMem are allowed
as secondary uses of the def-node).
I made this optimization a new phase, and placed it after loop-opts for these reasons:
- I do not want it to interfere with loop-opts, in particular with the autovectorizer (SuperWord).
- I don't want it to interfere with any other memory optimizations, this should just improve things if nothing else worked.
- Checking if two memory addresses are adjacent is much simpler after loop-opts, when some of the `CastII` nodes have disappeared, and the address expression becomes much simpler (in particular, the constants from the integer index can only sink through the CastI2L after loop-opts). We could do adjacency checking with a more complicated algorithm, such as `VPointer` in the current autovectorizer.
**Performance**
TODO
**Testing**
Tier 1-6 + stress-testing.
Performance testing: no significant difference.
-------------
Commit messages:
- fix flag initialization issue
- merge manually
- Merge branch 'master' into JDK-8318446
- make sure the unsafe accesses are always recognized
- swap number and type
- add short and int examples to bench
- rename array
- add LE-API to bench
- improved the benchmark
- fix bad if problem
- ... and 20 more: https://git.openjdk.org/jdk/compare/7e0a4ed6...adca9e22
Changes: https://git.openjdk.org/jdk/pull/16245/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16245&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8318446
Stats: 2007 lines in 11 files changed: 2006 ins; 0 del; 1 mod
Patch: https://git.openjdk.org/jdk/pull/16245.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16245/head:pull/16245
PR: https://git.openjdk.org/jdk/pull/16245
More information about the hotspot-compiler-dev
mailing list