RFR: 8342393: Promote commutative vector IR node sharing [v23]

Wed Feb 5 09:51:28 UTC 2025

On Thu, 30 Jan 2025 14:49:35 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering.
>> Similar to scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization.
>> 
>> Following are the performance stats for JMH micro included with the patch.
>> 
>> 
>> Granite Rapids (P-core Xeon Server)
>> Baseline : 
>> Benchmark                                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2   8982.549          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2   6072.773          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   2368.856          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  15215.087          ops/ms
>> 
>> Withopt:
>> Benchmark                                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2  11963.554          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2   7036.088          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   2906.731          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  17148.131          ops/ms
>> 
>> Sierra Forest (E-core Xeon Server)
>> Baseline:
>> Benchmark                                                                (size)   Mode  Cnt     Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2  2444.359          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2  1710.256          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   308.766          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  3902.179          ops/ms
>> 
>> Withopt:
>> Benchmark                                                                (size)   Mode  Cnt     Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.com...
>
> Jatin Bhateja has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision:
> 
>   Lowering feature check to IR annotation level

Changes requested by epeter (Reviewer).

src/hotspot/share/opto/vectornode.cpp line 1045:

> 1043: }
> 1044: 
> 1045: bool VectorNode::should_swap_inputs() {

Suggestion:

// If we have a AddVB(v1, v2) and AddVB(v2, v1), we want to swap the edges of one of them
// so that they become identical, and can common in global value numbering.
bool VectorNode::should_swap_inputs_to_help_global_value_numbering() {

src/hotspot/share/opto/vectornode.cpp line 1058:

> 1056:   if (is_predicated_vector()) {
> 1057:     return false;
> 1058:   }

Hmm. Can you give me a concrete example of a masked operation that would be filtered out?

Can it for example be a `AddVI`? But that only has 2 inputs for `VEC1` and `VEC2`. Where would the mask be located - and why does that not get us to `req() > 3`?

Ah, I see it can be added in `VectorNode::try_to_gen_masked_vector`, with `add_req`, but then we should have `req() > 3`.

Ok, this looks a bit complicated, but it looks like we are doing this.

  // Generate a vector mask for vector operation whose vector length is lower than the
  // hardware supported max vector length.

Ok, fine.

It could be good to add a comment here though, explaining why the operation seemingly has 3 inputs, but we don't exit at `req() != 3` above.

src/hotspot/share/opto/vectornode.cpp line 1105:

> 1103: 
> 1104:   // Sort inputs of commutative non-predicated vector operations to help value numbering.
> 1105:   if (should_swap_inputs()) {

Suggestion:

  if (should_swap_inputs_to_help_global_value_numbering()) {

src/hotspot/share/opto/vectornode.hpp line 91:

> 89:   static bool is_minmax_opcode(int opc);
> 90: 
> 91:   bool should_swap_inputs();

Suggestion:

  bool should_swap_inputs_to_help_global_value_numbering();

-------------

PR Review: https://git.openjdk.org/jdk/pull/22863#pullrequestreview-2595115845
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942531991
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942553076
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942531523
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1942532338