RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Emanuel Peter epeter at openjdk.org
Tue Jan 9 14:16:27 UTC 2024


On Tue, 9 Jan 2024 06:13:44 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Yes, IF it is vectorized, then there is no difference between high and low density. My concern was more if vectorization is preferrable over the scalar alternative in the low-density case, where branch prediction is more stable.
>
> At runtime we do need to scan entire mask to pick the compressible lane corresponding to set mask bit. Thus the loop overhead of mask compare (BTW masks are held in a vector register for AVX2 targets) and jump will anyways be incurred , in addition for sparsely populated mask we may incur additional misprediction penalty for not taking if block which  extracts an element from appropriate source vector lane and insert into destination vector lane. Overall vector solution will win for most common cases for varying mask and also for very sparsely populate masks.  Here is the result of setting just a single mask bit. 
> 
> 
>     @Benchmark
>     public void fuzzyFilterIntColumn() {
>        int i = 0;
>        int j = 0;
>        long maskctr = 1;
>        int endIndex = ispecies.loopBound(size);
>        for (; i < endIndex; i += ispecies.length()) {
>            IntVector vec = IntVector.fromArray(ispecies, intinCol, i);
>            VectorMask<Integer> pred = VectorMask.fromLong(ispecies, 1);
>            vec.compress(pred).intoArray(intoutCol, j);
>            j += pred.trueCount();
>        }
>    }
> 
> 
> Baseline:
> Benchmark                                                     (size)   Mode  Cnt    Score   Error   Units
> ColumnFilterBenchmark.fuzzyFilterIntColumn    1024  thrpt    2  379.059          ops/ms
> ColumnFilterBenchmark.fuzzyFilterIntColumn    2047  thrpt    2  188.355          ops/ms
> ColumnFilterBenchmark.fuzzyFilterIntColumn    4096  thrpt    2   95.315          ops/ms
> 
> 
> Withopt:
> Benchmark                                                     (size)   Mode  Cnt     Score   Error   Units
> ColumnFilterBenchmark.fuzzyFilterIntColumn    1024  thrpt    2  7390.074          ops/ms
> ColumnFilterBenchmark.fuzzyFilterIntColumn    2047  thrpt    2  3483.247          ops/ms
> ColumnFilterBenchmark.fuzzyFilterIntColumn    4096  thrpt    2  1823.817          ops/ms

Nice, thanks for the data!

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1446138902


More information about the hotspot-compiler-dev mailing list