RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store

Fri Jan 19 15:23:29 UTC 2024

On Wed, 18 Oct 2023 13:04:35 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> This is a feature requiested by @RogerRiggs and @cl4es .
> 
> **Idea**
> 
> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code.
> 
> This patch here supports a few simple use-cases, like these:
> 
> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant:
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395
> 
> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable:
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456
> 
> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):
> 
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472
> 
> **Details**
> 
> This draft currently implements the optimization in an additional special IGVN phase:
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485
> 
> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores:
> 
> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806
> 
> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either both store constants, or adjacent segments of a larger value ...

There already are internal APIs and `VarHandles` to enable similar optimizations, see e.g. `jdk.internal.util.ByteArray/-LittleEndian::putInt/-Long`.

The very point of this RFE was to opportunistically enable similar optimizations more automatically in idiomatic java code without the need to bring out the big guns. Of course such automatic transformations will have some level of fragility and you might accidentally disable the optimization in a variety of ways (since C2 needs to draw the line somewhere) - but that's  the case for many other heuristic and opportunistic optimizations. Should we not optimize counted loops or do loop unrolling because it's easy to add something that makes C2 bail out on you?

Having this optimization in C2 also allows us to avoid dependencies on `VarHandles` in bootstrap sensitive code and still enable the optimization. It might also have a benefit on startup/warmup characteristics.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1900612869