RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v3]

Fri Jan 5 10:05:22 UTC 2024

On Fri, 5 Jan 2024 07:08:35 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi,
>> 
>> Patch optimizes non-subword vector compress and expand APIs for x86 AVX2 only targets.
>> Upcoming E-core Xeons (Sierra Forest) and Hybrid CPUs only support AVX2 instruction set.
>> These are very frequently used APIs in columnar database filter operation.
>> 
>> Implementation uses a lookup table to record permute indices. Table index is computed using
>> mask argument of compress/expand operation.
>> 
>> Following are the performance number of JMH micro included with the patch.
>> 
>> 
>> System : Intel(R) Xeon(R) Platinum 8480+ (Sapphire Rapids)
>> 
>> Baseline:
>> Benchmark                                 (size)   Mode  Cnt    Score   Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  142.767          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   71.436          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   35.992          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  182.151          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2   91.096          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2   44.757          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  184.099          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt    2   91.981          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       4096  thrpt    2   45.170          ops/ms
>> ColumnFilterBenchmark.filterLongColumn      1024  thrpt    2  148.017          ops/ms
>> ColumnFilterBenchmark.filterLongColumn      2047  thrpt    2   73.516          ops/ms
>> ColumnFilterBenchmark.filterLongColumn      4096  thrpt    2   36.844          ops/ms
>> 
>> Withopt:
>> Benchmark                                 (size)   Mode  Cnt     Score   Error   Units
>> ColumnFilterBenchmark.filterDoubleColumn    1024  thrpt    2  2051.707          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    2047  thrpt    2   914.072          ops/ms
>> ColumnFilterBenchmark.filterDoubleColumn    4096  thrpt    2   489.898          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     1024  thrpt    2  5324.195          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     2047  thrpt    2  2587.229          ops/ms
>> ColumnFilterBenchmark.filterFloatColumn     4096  thrpt    2  1278.665          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       1024  thrpt    2  4149.384          ops/ms
>> ColumnFilterBenchmark.filterIntColumn       2047  thrpt  ...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions.

Thanks for the updates!

One more idea: Your AVX2 solution has a lot of cost for converting the mask to a permutation. Might it make sense to split this off into a separate vector-node, so that it can float out of a loop if the mask is invariant?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 963:

> 961:     // or a -1 (default) value.
> 962:     for (int i = 0; i < 256; i++) {
> 963:       int tmp = i;

why is `tmp` needed? Would it not be better to replace `i` with `mask` (i.e. the bit pattern that is then translated to a permutation)?

src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 966:

> 964:       int ctr = 0;
> 965:       for (int j = 0; j < 8; j++) {
> 966:         if (tmp & (1 << j)) {

Suggestion:

        if (mask & (1 << j)) {

would be much more readable

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/17261#pullrequestreview-1805616736
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1442664755
PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1442668939