RFR: 8342393: Promote commutative vector IR node sharing [v8]

Thu Jan 16 06:46:36 UTC 2025

On Wed, 15 Jan 2025 04:13:23 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Patch promotes the sharing of commutative vector IR with the same inputs but different input ordering.
>> Unlike scalar IR where we perform edge swapping by [sorting inputs](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/addnode.cpp#L122) based on node indices during IR idealization, for vector IR we chose a simpler approach to decorate commutative operations with a special node-level flag during IR construction thus
>> obviating any dependency on explicit idealization routines. This flag is later used during GVN hashing to enable node sharing.  
>> 
>> Following are the performance stats for JMH micro included with the patch.
>> 
>> 
>> Granite Rapids (P-core Xeon Server)
>> Baseline : 
>> Benchmark                                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2   8982.549          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2   6072.773          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   2368.856          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  15215.087          ops/ms
>> 
>> Withopt:
>> Benchmark                                                                (size)   Mode  Cnt      Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2  11963.554          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2   7036.088          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   2906.731          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeShortOperationShairing    1024  thrpt    2  17148.131          ops/ms
>> 
>> Sierra Forest (E-core Xeon Server)
>> Baseline:
>> Benchmark                                                                (size)   Mode  Cnt     Score   Error   Units
>> VectorCommutativeOperSharingBenchmark.commutativeByteOperationShairing     1024  thrpt    2  2444.359          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeIntOperationShairing      1024  thrpt    2  1710.256          ops/ms
>> VectorCommutativeOperSharingBenchmark.commutativeLongOperationShairing     1024  thrpt    2   308.766          ops/ms
>> VectorCommutativeOperSharingBenc...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Extension covering masked vector operations

test/hotspot/jtreg/compiler/vectorapi/VectorCommutativeOperSharingTest.java line 418:

> 416:         // predicated (vec1 + vec2) + (vec2 + vec1)
> 417:         vec1.lanewise(VectorOperators.ADD, vec2, vmask)
> 418:             .lanewise(VectorOperators.ADD, vec2.lanewise(VectorOperators.ADD, vec1, vmask))

Thanks for your updating @jatin-bhateja!  

But I don't think it's right here. Do you mean `vec1.lanewise(VectorOperators.ADD, vec2, vmake)` equals to `vec2.lanewise(VectorOperators.ADD, vec1, vmask)` ?  The inactive lane values are not equal, right?  Should it keep the lane value of first input for inactive lanes? Thanks!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/22863#discussion_r1917828758