RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v2]

Mon Jan 8 10:36:25 UTC 2024

On Mon, 8 Jan 2024 06:06:20 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> You are using `VectorMask<Integer> pred = VectorMask.fromLong(ispecies, maskctr++);`.
>> That basically systematically iterates over all masks, which is nice for a correctness test.
>> But that would use different density inside one test run, right? The average over the loop is still at `50%`, correct?
>> 
>> I was thinking more a run where the percentage over the whole loop is lower than maybe `1%`. That would get us to a point where maybe the branch prediction of non-vectorized code might be faster, what do you think?
>
>> You are using `VectorMask<Integer> pred = VectorMask.fromLong(ispecies, maskctr++);`. That basically systematically iterates over all masks, which is nice for a correctness test. But that would use different density inside one test run, right? The average over the loop is still at `50%`, correct?
>> 
>> I was thinking more a run where the percentage over the whole loop is lower than maybe `1%`. That would get us to a point where maybe the branch prediction of non-vectorized code might be faster, what do you think?
> 
> An imperative loop for compression will check each mask bit to select compressible lane. Therefore mask with low or high density of set bits should show similar performance.

Yes, IF it is vectorized, then there is no difference between high and low density. My concern was more if vectorization is preferrable over the scalar alternative in the low-density case, where branch prediction is more stable.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17261#discussion_r1444257535