RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]
Emanuel Peter
epeter at openjdk.org
Thu Jan 4 13:45:26 UTC 2024
On Thu, 4 Jan 2024 05:39:01 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi,
>>
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>>
>> Implementation uses a lookup table to record permute indices. Table index is computed using
>> mask argument of compress/expand operation.
>>
>> Following are the performance number of JMH micro included with the patch.
>>
>>
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>>
>> Baseline:
>> Benchmark (size) Mode Cnt Score Error Units
>> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 142.767 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 71.436 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 35.992 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 182.151 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 91.096 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 44.757 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 184.099 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 2047 thrpt 2 91.981 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 4096 thrpt 2 45.170 ops/ms
>> ColumnFilterBenchmark.filterLongColumn 1024 thrpt 2 148.017 ops/ms
>> ColumnFilterBenchmark.filterLongColumn 2047 thrpt 2 73.516 ops/ms
>> ColumnFilterBenchmark.filterLongColumn 4096 thrpt 2 36.844 ops/ms
>>
>> Withopt:
>> Benchmark (size) Mode Cnt Score Error Units
>> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 2051.707 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 914.072 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 489.898 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 5324.195 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 2587.229 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 1278.665 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 4149.384 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Updating copyright year of modified files.
@jatin-bhateja this looks like a great improvement!
I have a few questions and requests below.
FYI, this feels very inspiring. I'm dreaming of a day where we could do this filtering in the auto-vectorizer directly.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5303:
> 5301: // Blend the results with zero vector using permute vector as mask, its
> 5302: // non-participating lanes holds a -1 value.
> 5303: vblendvps(dst, dst, xtmp, permv, vec_enc);
would you mind adding a few more comments to explain what happens here? I would also really appreciate more expressive register/variable names.
I think you are basically converting the `mask` to a permutation `permv`, by a lookup in the table.
Then you permute the `src` and blend it with a -1 vector, so that the unused (high) lanes are -1.
xtmp -> min_one
rtmp -> table_index
rscratch -> table_adr
src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5307:
> 5305: assert(bt == T_LONG || bt == T_DOUBLE, "");
> 5306: vmovmskpd(rtmp, mask, vec_enc);
> 5307: shlq(rtmp, 5);
Might this need to be 6? If I understand right, then you want to have a 64bit stride, hence 2^6, right?
If that is correct, then this did not show in your tests, and you need a regression test anyway.
src/hotspot/cpu/x86/c2_MacroAssembler_x86.hpp line 488:
> 486: KRegister ktmp1, int vec_enc);
> 487:
> 488:
Remove useless empty line
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 957:
> 955: __ align(CodeEntryAlignment);
> 956: StubCodeMark mark(this, "StubRoutines", stub_name);
> 957: address start = __ pc();
Could you please add some comments here why you are filling the data like this?
Presumably, you are emitting 32 bits and 64 bits respectively, right? So the cells have different size, correct?
test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 76:
> 74: longinCol = new long[size];
> 75: longoutCol = new long[size];
> 76: lpivot = size / 2;
I'd be interested to see what happens if you move up or down the "density" of elements that you accept. Would the simple branch prediction be faster if the density is low enough, i.e. we almost take no element.
Though maybe that is not compiler problem but a user-problem?
test/micro/org/openjdk/bench/jdk/incubator/vector/ColumnFilterBenchmark.java line 94:
> 92: IntVector vec = IntVector.fromArray(ispecies, intinCol, i);
> 93: VectorMask<Integer> pred = vec.compare(VectorOperators.GT, ipivot);
> 94: vec.compress(pred).intoArray(intoutCol, j);
Could there be equivalent `expand` tests?
-------------
Changes requested by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1804121213
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441749005
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441761312
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441724949
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441759984
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441753158
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1441729256
More information about the hotspot-compiler-dev
mailing list