RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store

Mon Jan 29 07:48:28 UTC 2024

On Tue, 16 Jan 2024 23:08:37 GMT, Vladimir Kozlov <kvn at openjdk.org> wrote:

>> This is a feature requiested by @RogerRiggs and @cl4es .
>> 
>> **Idea**
>> 
>> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code.
>> 
>> This patch here supports a few simple use-cases, like these:
>> 
>> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant:
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395
>> 
>> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable:
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456
>> 
>> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):
>> 
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472
>> 
>> **Details**
>> 
>> This draft currently implements the optimization in an additional special IGVN phase:
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485
>> 
>> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores:
>> 
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806
>> 
>> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot...
>
> About changes. May be you can use something similar to ClearArrayNode. Collect all stores into one node and corresponding Mach (machine) nodes will implement it using available instructions instead of C2 decide the size of combined store.
> 
> One drawback for these changes I see that you may use a lot more registers to keep all values.
> 
> For constants you need to keep in mind the order of memory (little or big endian).

@vnkozlov exactly, it is some kind of stright-line vectorization, at least if we generalized the pattern. And I asked about that earlier:

> @merykitty @cl4es @RogerRiggs @vnkozlov I wonder if you think that the approach of this PR is good, and if you have any suggestions about it?
>Is a separate phase ok?
>Is this PR in a sweet-spot that reaches the goals of the library-folks, but is not too complex?
Would you prefer a more general solution, like a straight-line SLP algorithm, that can merge (even vectorize) any load / store sequences, even merge accesses with different element sizes and with gaps/padding?

Currently, we only vectoize loops. But this patch here also optimizes straight-line code. Of course performance mostly matters in loops, but sometimes not everything gets inlined, and then a straight-line optimization is still helpful.

It would be nice to include similar patterns into the auto-vectorizer. But it would require detecting patterns that cross vector-lanes, and we currently only do that for reductions. And for storing constants, we currently also only allow if all lanes use the same value, and it can be broadcast. I have been thinking about all these things, but I will have to see what is feasible, and I think they are a bit lower priority for now.

TLDR: this RFE here deals with straight-line code, the vectorizer deals (so far) only with loops. So they are 2 optimizations that can stand on their own.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16245#issuecomment-1914131085