RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Thu Jan 4 13:45:26 UTC 2024

On Thu, 4 Jan 2024 05:39:01 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark                                 (size)   Mode  Cnt    Score   Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  142.767          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   71.436          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   35.992          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  182.151          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2   91.096          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2   44.757          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  184.099          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt    2   91.981          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       4096  thrpt    2   45.170          ops/ms
>> ColumnFilterBenchmark.filterLongColumn      1024  thrpt    2  148.017          ops/ms
>> ColumnFilterBenchmark.filterLongColumn      2047  thrpt    2   73.516          ops/ms
>> ColumnFilterBenchmark.filterLongColumn      4096  thrpt    2   36.844          ops/ms
>> 
>> Withopt:
>> Benchmark                                 (size)   Mode  Cnt     Score   Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  2051.707          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   914.072          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   489.898          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  5324.195          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2  2587.229          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2  1278.665          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  4149.384          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Updating copyright year of modified files.

@jatin-bhateja this looks like a great improvement!
I have a few questions and requests below.

FYI, this feels very inspiring. I'm dreaming of a day where we could do this filtering in the auto-vectorizer directly.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5303:

> 5301:     // Blend the results with zero vector using permute vector as mask, its
> 5302:     // non-participating lanes holds a -1 value.
> 5303:     vblendvps(dst, dst, xtmp, permv, vec_enc);

would you mind adding a few more comments to explain what happens here? I would also really appreciate more expressive register/variable names.

I think you are basically converting the `mask` to a permutation `permv`, by a lookup in the table.
Then you permute the `src` and blend it with a -1 vector, so that the unused (high) lanes are -1.

xtmp -> min_one
rtmp -> table_index
rscratch -> table_adr

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5307:

> 5305:     assert(bt == T_LONG || bt == T_DOUBLE, "");
> 5306:     vmovmskpd(rtmp, mask, vec_enc);
> 5307:     shlq(rtmp, 5);

Might this need to be 6? If I understand right, then you want to have a 64bit stride, hence 2^6, right?
If that is correct, then this did not show in your tests, and you need a regression test anyway.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 488:

> 486:                           KRegister ktmp1, int vec_enc);
> 487: 
> 488: 

Remove useless empty line

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 957:

> 955:   __ align(CodeEntryAlignment);
> 956:   StubCodeMark mark(this, "StubRoutines", stub_name);
> 957:   address start = __ pc();

Could you please add some comments here why you are filling the data like this?
Presumably, you are emitting 32 bits and 64 bits respectively, right? So the cells have different size, correct?

test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 76:

> 74:         longinCol = new long[size];
> 75:         longoutCol = new long[size];
> 76:         lpivot = size / 2;

I'd be interested to see what happens if you move up or down the "density" of elements that you accept. Would the simple branch prediction be faster if the density is low enough, i.e. we almost take no element.

Though maybe that is not compiler problem but a user-problem?

test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 94:

> 92:            IntVector vec = IntVector.fromArray(ispecies, intinCol, i);
> 93:            VectorMask<Integer> pred = vec.compare(VectorOperators.GT, ipivot);
> 94:            vec.compress(pred).intoArray(intoutCol, j);

Could there be equivalent `expand` tests?

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1804121213
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441749005
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441761312
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441724949
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441759984
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441753158
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441729256