RFR: 8302652: [SuperWord] Reduction should happen after loop, when possible

Fri May 5 07:39:30 UTC 2023

On Wed, 22 Mar 2023 22:17:29 GMT, Sandhya Viswanathan <sviswanathan at openjdk.org> wrote:

>> @jatin-bhateja @sviswa7
>> What do you think about the performance numbers I measured? Do they make sense to you?
>> 
>> A few questions:
>>  - `long min/max`: Why do we require `avx512vlbwdq` in `Matcher::match_rule_supported_vector`? Would `avx512f` not be sufficient? `C2_MacroAssembler::reduceL` leads me to `vextracti64x4` (that should only require `avx512f`) and `reduce_operation_256` (where `vpminsq` only requires `avx2` via assert).
>>  - What do you think about the `double min/max` performance? What do you think could be the reason it is not similar to the behavior of `float min/max`?
>
> @eme64 For long min/max, currently Math.min(long, long) is not getting intrinsified. Only int/float/double are getting intrinsified. No scalar intrinsification for Math.min(long, long) leads to no MinL scalar node generation and in turn no vectorization and no reduction.

> @sviswa7 thanks for your quick response!
> 
> I can confirm: we do not "intrinsify" (ie turn into `MinL/MaxL`), rather we just inline the `java.lang.Math::Min/Max` methods, implemented with `CmpL` / `If`-branching. Do you think this makes sense, or should we intrinsify, at least when the hardware supports it?

@eme64 We should intrinsify MinL/MaxL when the hardware supports it.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/13056#issuecomment-1481977784