RFR: 8318446: C2: optimize stores into primitive arrays by combining values into larger store [v33]

Fri Apr 5 19:02:14 UTC 2024

On Fri, 5 Apr 2024 10:49:30 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> This is a feature requiested by @RogerRiggs and @cl4es .
>> 
>> **Idea**
>> 
>> Merging multiple consecutive small stores (e.g. 8 byte stores) into larger stores (e.g. one long store) can lead to speedup. Recently, @cl4es and @RogerRiggs had to review a few PR's where people would try to get speedups by using Unsafe (e.g. `Unsafe.putLongUnaligned`), or ByteArrayLittleEndian (e.g. `ByteArrayLittleEndian.setLong`). They have asked if we can do such an optimization in C2, rather than in the Java library code, or even user code.
>> 
>> This patch here supports a few simple use-cases, like these:
>> 
>> Merge consecutive array stores, with constants. We can combine the separate constants into a larger constant:
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L383-L395
>> 
>> Merge consecutive array stores, with a variable that was split (using shifts). We can essentially undo the splitting (i.e. shifting and truncation), and directly store the variable:
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L444-L456
>> 
>> The idea is that this would allow the introduction of a very simple API, without any "heavy" dependencies (Unsafe or ByteArrayLittleEndian):
>> 
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L327-L338
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/test/hotspot/jtreg/compiler/c2/TestMergeStores.java#L467-L472
>> 
>> **Details**
>> 
>> This draft currently implements the optimization in an additional special IGVN phase:
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/compile.cpp#L2479-L2485
>> 
>> We first collect all `StoreB|C|I`, and put them in the IGVN worklist (see `Compile::gather_nodes_for_merge_stores`). During IGVN, we call `StoreNode::Ideal_merge_stores` at the end (after all other optimizations) of `StoreNode::Ideal`. We essentially try to establish a chain of mergable stores:
>> 
>> https://github.com/openjdk/jdk/blob/adca9e220822884d95d73c7f070adeee2632130d/src/hotspot/share/opto/memnode.cpp#L2802-L2806
>> 
>> Mergable stores must have the same Opcode (implies they have the same element type and hence size). Further, mergable stores must have the same control (or be separated by only a RangeCheck). Further, they must either bot...
>
> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision:
> 
>   handle UseUnalignedAccesses in test, and a few cosmetics

src/hotspot/share/opto/memnode.cpp line 2880:

> 2878: // the optimization, if this RangeCheck[i+1] fails, then we execute only StoreB[i+0], and then trap. After
> 2879: // the optimization, the new StoreI[i+0] is on the passing path of RangeCheck[i+1], and StoreB[i+0] on the
> 2880: // failing path.

Can we detect presence of RangeCheck which may cause us to move some stores on fail path and bailout the optimization. I don't think it is frequent case.  I assume you will get RC on each store or not at all ("main" part of counted loop). Am I wrong here?
I don't remember, does C2 optimize RangeCheck nodes in linear code (it does in loops)?

test/micro/org/openjdk/bench/vm/compiler/MergeStores.java line 2:

> 1: /*
> 2:  * Copyright (c) 2023, Oracle and/or its affiliates. All rights reserved.

New Year ;)

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1554169054
PR Review Comment: https://git.openjdk.org/jdk/pull/16245#discussion_r1554156929