RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v12]

Thu Apr 17 10:16:58 UTC 2025

On Mon, 7 Apr 2025 02:24:03 GMT, kuaiwei <duke at openjdk.org> wrote:

>> In this patch, I extent the merge stores optimization to merge adjacents loads. Tier1 tests are passed in my machine.
>> 
>> The benchmark result of MergeLoadBench.java
>> AMD EPYC 9T24 96-Core Processor:
>> 
>> |name   | -MergeLoads   | +MergeLoads   |delta|
>> |---|---|---|---|
>> |MergeLoadBench.getCharB        |4352.150       |4407.435       | 55.29 |
>> |MergeLoadBench.getCharBU       |4075.320       |4084.663       | 9.34 |
>> |MergeLoadBench.getCharBV       |3221.302       |3221.528       | 0.23 |
>> |MergeLoadBench.getCharC        |2235.433       |2238.796       | 3.36 |
>> |MergeLoadBench.getCharL        |4363.244       |4372.281       | 9.04 |
>> |MergeLoadBench.getCharLU       |4072.550       |4075.744       | 3.19 |
>> |MergeLoadBench.getCharLV       |2227.825       |2231.612       | 3.79 |
>> |MergeLoadBench.getIntB |11199.935      |6869.030       | -4330.90 |
>> |MergeLoadBench.getIntBU        |6853.862       |2763.923       | -4089.94 |
>> |MergeLoadBench.getIntBV        |306.953        |309.911        | 2.96 |
>> |MergeLoadBench.getIntL |10426.843      |6523.716       | -3903.13 |
>> |MergeLoadBench.getIntLU        |6740.847       |2602.701       | -4138.15 |
>> |MergeLoadBench.getIntLV        |2233.151       |2231.745       | -1.41 |
>> |MergeLoadBench.getIntRB        |11335.756      |8980.619       | -2355.14 |
>> |MergeLoadBench.getIntRBU       |7439.873       |3190.208       | -4249.66 |
>> |MergeLoadBench.getIntRL        |16323.040      |7786.842       | -8536.20 |
>> |MergeLoadBench.getIntRLU       |7457.745       |3364.140       | -4093.61 |
>> |MergeLoadBench.getIntRU        |2512.621       |2511.668       | -0.95 |
>> |MergeLoadBench.getIntU |2501.064       |2500.629       | -0.43 |
>> |MergeLoadBench.getLongB        |21175.442      |21103.660      | -71.78 |
>> |MergeLoadBench.getLongBU       |14042.046      |2512.784       | -11529.26 |
>> |MergeLoadBench.getLongBV       |606.448        |606.171        | -0.28 |
>> |MergeLoadBench.getLongL        |23142.178      |23217.785      | 75.61 |
>> |MergeLoadBench.getLongLU       |14112.972      |2237.659       | -11875.31 |
>> |MergeLoadBench.getLongLV       |2230.416       |2231.224       | 0.81 |
>> |MergeLoadBench.getLongRB       |21152.558      |21140.583      | -11.98 |
>> |MergeLoadBench.getLongRBU      |14031.178      |2520.317       | -11510.86 |
>> |MergeLoadBench.getLongRL       |23248.506      |23136.410      | -112.10 |
>> |MergeLoadBench.getLongRLU      |14125.032      |2240.481       | -11884.55 |
>> |Merg...
>
> kuaiwei has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 14 commits:
> 
>  - Merge remote-tracking branch 'origin/master' into dev/merge_loads
>  - Remove unused code
>  - Move code to addnode.cpp and add more tests
>  - Merge remote-tracking branch 'origin/master' into dev/merge_loads
>  - Fix test
>  - Add more tests
>  - Enable StressIGVN and riscv platform
>  - Change tests as review comments
>  - Fix test failure and change for review comments
>  - Revert extract value and add more tests
>  - ... and 4 more: https://git.openjdk.org/jdk/compare/660b17a6...f6518b26

Even more coming later...

src/hotspot/share/opto/addnode.cpp line 816:

> 814:   LoadNode* const    _load;
> 815:   Node*     const _combine;
> 816:   int       const   _shift;

Suggestion:

  jint       const   _shift;

I prefer using `jint` etc when we are talking about values that correlate to java types. With the C/C++ numerical types, there could always be issues on different platforms with different bit sizes.

src/hotspot/share/opto/addnode.cpp line 823:

> 821:   void dump() {
> 822:     tty->print_cr("MergeLoadInfo: load: %d, combine: %d, shift: %d",
> 823:                   _load->_idx, _combine->_idx, _shift);

It would be nice if you also printed the Node name. e.g. `_load->Name()`. For me that is often more helpful.

src/hotspot/share/opto/addnode.cpp line 912:

> 910:     return false;
> 911:   }
> 912:   return true;

This feels inverted. I would do a switch case here anyway, and return `false` in the default case.

Hmm. LoadS is supposed to be implicitly sign extended. Can those upper sign bits not lead to issues when OR-ing without a mask?

src/hotspot/share/opto/addnode.cpp line 916:

> 914: 
> 915: bool MergePrimitiveLoads::is_supported_combine_opcode(int opc) {
> 916:   return opc == Op_OrI || opc == Op_OrL;

Ah, and here you do it "positively". I would also recommend using a switch case, it is nicer to extend later for AND and XOR.

src/hotspot/share/opto/addnode.cpp line 920:

> 918: 
> 919: // Go through ConvI2L which is unique output of input node
> 920: const Node* MergePrimitiveLoads::by_pass_i2l(const Node* n) {

Suggestion:

const Node* MergePrimitiveLoads::bypass_ConvI2L(const Node* n) {

I think `bypass` is a single verb, not two wordy ;)

src/hotspot/share/opto/addnode.cpp line 933:

> 931:  *
> 932:  * Load -> OrI/OrL
> 933:  *  It has no LShift usage and the And node is optimized out in previous optimization

This is a bit confusing... I thought what you wrote further up was much better, where you also had the ascii art. This here feels like a duplication of that up there. If there is anything here that is missing up there, then I would suggest that you just move it up there.

src/hotspot/share/opto/addnode.cpp line 946:

> 944:  *            | ((UNSAFE.getByte(array, address + 3) & 0xff) << 24);
> 945:  */
> 946: bool MergePrimitiveLoads::is_merged_load_candidate() const {

What is your definition of candidate here? It seems to have something to do about not having a `LShift`, why? Maybe you can give a good definition here or somewhere else?

src/hotspot/share/opto/addnode.cpp line 950:

> 948:   const Node* check = by_pass_i2l(_combine);
> 949:   if (check->outcnt() == 1 && check->unique_out()->Opcode() == _combine->Opcode()) {
> 950:     // It's in the middle of combine operators

Ah, I see. This is about checking that there is nothing further down to merge with....

But are you sure this is a good idea to check that there is no `OR` below? I mean there could be valid uses of `OrI` below that do some OR-ing with something else... Now you are forbidding these cases.

Look at this:

int x = (... merge load pattern with OR ...);
int y = ... some other value ...
int z = x | y;

I would expect that we could merge the loads here too, but your pattern matching seems to forbid it, right?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/24023#pullrequestreview-2775212830
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048628508
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048631063
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048634906
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048635926
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048656876
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048641180
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048647672
PR Review Comment: https://git.openjdk.org/jdk/pull/24023#discussion_r2048652111