RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v7]
Sandhya Viswanathan
sviswanathan at openjdk.org
Sat Jan 20 01:18:29 UTC 2024
On Fri, 19 Jan 2024 19:03:31 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi,
>>
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>>
>> Implementation uses a lookup table to record permute indices. Table index is computed using
>> mask argument of compress/expand operation.
>>
>> Following are the performance number of JMH micro included with the patch.
>>
>>
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>>
>> Baseline:
>> Benchmark (size) Mode Cnt Score Error Units
>> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 142.767 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 71.436 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 35.992 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 182.151 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 91.096 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 44.757 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 184.099 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 2047 thrpt 2 91.981 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 4096 thrpt 2 45.170 ops/ms
>> ColumnFilterBenchmark.filterLongColumn 1024 thrpt 2 148.017 ops/ms
>> ColumnFilterBenchmark.filterLongColumn 2047 thrpt 2 73.516 ops/ms
>> ColumnFilterBenchmark.filterLongColumn 4096 thrpt 2 36.844 ops/ms
>>
>> Withopt:
>> Benchmark (size) Mode Cnt Score Error Units
>> ColumnFilterBenchmark.filterDoubleColumn 1024 thrpt 2 2051.707 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 2047 thrpt 2 914.072 ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn 4096 thrpt 2 489.898 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 1024 thrpt 2 5324.195 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 2047 thrpt 2 2587.229 ops/ms
>> ColumnFilterBenchmark.filterFloatColumn 4096 thrpt 2 1278.665 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 1024 thrpt 2 4149.384 ops/ms
>> ColumnFilterBenchmark.filterIntColumn 2047 thrpt ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Modified code comment for clarity.
src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 985:
> 983: for (int j = 0; j < 4; j++) {
> 984: if (mask & (1 << j)) {
> 985: __ emit_data64(j, relocInfo::none);
This could be something like __ emit_data(2*j, relocInfo::none); __ emit_data(2*j+1, relocInfo::none) to have the double word masks in the table to begin with.
Then we don't need the extra instructions in vector_compress_expand_avx2() to generate double word permute masks from long masks.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1460113427
More information about the hotspot-compiler-dev
mailing list