RFR: 8342393: Promote commutative vector IR node sharing [v4]

Wed Jan 8 13:05:40 UTC 2025

On Wed, 8 Jan 2025 10:45:41 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering.
>> Unlike scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization, for vector IR we chose a simpler approach to decorate commutative operations with a special node-level flag during IR construction thus
>> obviating any dependency on explicit idealization routines. This flag is later used during GVN hashing to enable node sharing.  
>> 
>> Following are the performance stats for JMH micro included with the patch.
>> 
>> 
>> Granite Rapids (P-core Xeon Server)
>> Baseline : 
>> Benchmark                                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2   8982.549          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2   6072.773          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   2368.856          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  15215.087          ops/ms
>> 
>> Withopt:
>> Benchmark                                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2  11963.554          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2   7036.088          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   2906.731          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  17148.131          ops/ms
>> 
>> Sierra Forest (E-core Xeon Server)
>> Baseline:
>> Benchmark                                                                (size)   Mode  Cnt     Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2  2444.359          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2  1710.256          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   308.766          ops/ms
>> VectorCommutativeOperSharingBenc...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   strict assertion check on commutative operation input count

Looks promising, thanks for the work!

src/hotspot/share/opto/phaseX.cpp line 148:

> 146:         if (k->in(0) == n->in(0) &&
> 147:             (k->in(1) == n->in(1) || k->in(1) == n->in(2)) &&
> 148:             (k->in(2) == n->in(1) || k->in(2) == n->in(2))) {

Would be good to have IR tests that verify the commoning of all permutations.
And that wrong patterns do not common: `add(x,y)` and `add(x,x)`.

src/hotspot/share/opto/phaseX.cpp line 157:

> 155:           if( n->in(i)!=k->in(i)) // Different inputs?
> 156:             goto collision;       // "goto" is a speed hack...
> 157:       }

I'm also not a fan of the `goto` hacks. Looks like bad design. I would vote for creating helper functions and calling those instead...

test/hotspot/jtreg/compiler/vectorapi/VectorCommutativeOperSharingTest.java line 139:

> 137:         }
> 138:     }
> 139: }

As mentioned above, it would be good to have all combinations of 2 ops, with inputs x and y:

add(x,x)  with add(x,x)
add(y,x)  with add(x,x)
add(x,y)  with add(x,x)
add(y,y)  with add(x,x)
add(x,x)  with add(y,x)
add(y,x)  with add(y,x)
add(x,y)  with add(y,x)
add(y,y)  with add(y,x)
add(x,x)  with add(x,y)
add(y,x)  with add(x,y)
add(x,y)  with add(x,y)
add(y,y)  with add(x,y)
add(x,x)  with add(y,y)
add(y,x)  with add(y,y)
add(x,y)  with add(y,y)
add(y,y)  with add(y,y)

At least do this for one operator. All may be a bit much to write... Templates would really be fantastic here... coming soon.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/22863#pullrequestreview-2537058290
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1907139452
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1907141030
PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1907149667