RFR: 8320794: Emulate rest of vblendvp[sd] on ECore [v2]

Thu Mar 28 00:07:32 UTC 2024

On Tue, 19 Mar 2024 15:02:32 GMT, Volodymyr Paprotski <duke at openjdk.org> wrote:

>> Replace vpblendvp[sd] with macro assembler call and test in:
>> - `C2_MacroAssembler::vector_cast_float_to_int_special_cases_avx` (insufficient registers for 1 of 2 blends)
>> - `C2_MacroAssembler::vector_cast_double_to_int_special_cases_avx`
>> - `C2_MacroAssembler::vector_count_leading_zeros_int_avx`
>> 
>> Functional testing with existing and new tests:
>> `make test TEST="test/hotspot/jtreg/compiler/vectorapi/reshape test/hotspot/jtreg/compiler/vectorization/runner/BasicIntOpTest.java"`
>> 
>> Benchmarking with existing and new tests:
>> 
>> make test TEST="micro:org.openjdk.bench.jdk.incubator.vector.VectorFPtoIntCastOperations.microFloat256ToInteger256"
>> make test TEST="micro:org.openjdk.bench.jdk.incubator.vector.VectorFPtoIntCastOperations.microDouble256ToInteger256"
>> make test TEST="micro:org.openjdk.bench.vm.compiler.VectorBitCount.WithSuperword.intLeadingZeroCount"
>> 
>> 
>> Performance before:
>> 
>> Benchmark                                               (SIZE)   Mode  Cnt      Score     Error   Units
>> VectorFPtoIntCastOperations.microDouble256ToInteger256     512  thrpt    5  17271.078 ± 184.140  ops/ms
>> VectorFPtoIntCastOperations.microDouble256ToInteger256    1024  thrpt    5   9310.507 ±  88.136  ops/ms
>> VectorFPtoIntCastOperations.microFloat256ToInteger256     512  thrpt    5  11137.594 ± 19.009  ops/ms
>> VectorFPtoIntCastOperations.microFloat256ToInteger256    1024  thrpt    5   5425.001 ±  3.136  ops/ms
>> VectorBitCount.WithSuperword.intLeadingZeroCount    1024       0  thrpt    4  0.994 ± 0.002  ops/us
>> 
>> 
>> Performance after:
>> 
>> Benchmark                                               (SIZE)   Mode  Cnt      Score     Error   Units
>> VectorFPtoIntCastOperations.microDouble256ToInteger256     512  thrpt    5  19222.048 ± 87.622  ops/ms
>> VectorFPtoIntCastOperations.microDouble256ToInteger256    1024  thrpt    5   9233.245 ± 123.493  ops/ms
>> VectorFPtoIntCastOperations.microFloat256ToInteger256     512  thrpt    5  11672.806 ± 10.854  ops/ms
>> VectorFPtoIntCastOperations.microFloat256ToInteger256    1024  thrpt    5   6009.735 ± 12.173  ops/ms
>> VectorBitCount.WithSuperword.intLeadingZeroCount    1024       0  thrpt    4  1.039 ± 0.004  ops/us
>
> Volodymyr Paprotski has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Fix double pasted test

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4871:

> 4869:   vpxor(xtmp3, xtmp2, xtmp4, vec_enc);
> 4870: 
> 4871:   vblendvps(dst, dst, xtmp1, xtmp3, vec_enc, true, xtmp4);

The vblendvps at line 4861 could also be emulated:
From:
  vpxor(xtmp4, xtmp4, xtmp4, vec_enc);
  vcmpps(xtmp3, src, src, Assembler::UNORD_Q, vec_enc);
  vblendvps(dst, dst, xtmp4, xtmp3, vec_enc);

To:
  vpxor(xtmp4, xtmp4, xtmp4, vec_enc);
  vcmpps(xtmp3, src, src, Assembler::UNORD_Q, vec_enc);
  vblendvps(dst, dst, xtmp4, xtmp3, vec_enc, false, xtmp4);

src/hotspot/cpu/x86/macroAssembler_x86.cpp line 3524:

> 3522:   bool blend_emulation = EnableX86ECoreOpts && UseAVX > 1;
> 3523:   bool scratch_available = scratch != xnoreg && scratch != src1 && scratch != src2 && scratch != dst;
> 3524:   bool dst_available = (dst != mask || compute_mask) && (dst != src1 || dst != src2);

There are two paths here:
Path 1: When compute_mask == true
              scratch_available = (scratch != xnoreg) && (scratch != src1) && (scratch != src2) && (scratch != dst);
              dst_available = (dst != mask) && (dst != src1 || dst != src2);
Path 2: When compute_mask == false
              scratch_available = (scratch != xnoreg) && (scratch != dst);
              dst_available = (dst != mask) && (dst != src1 || dst != src2);

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/18310#discussion_r1542167094
PR Review Comment: https://git.openjdk.org/jdk/pull/18310#discussion_r1542164547