RFR: 8322768: Optimize non-subword vector compress and expand APIs for AVX2 target. [v3]

Tue Jan 9 07:42:20 UTC 2024

On Mon, 8 Jan 2024 10:20:33 GMT, Quan Anh Mai <qamai at openjdk.org> wrote:

>>> Thanks for the updates!
>>> 
>>> One more idea: Your AVX2 solution has a lot of cost for converting the mask to a permutation. Might it make sense to split this off into a separate vector-node, so that it can float out of a loop if the mask is invariant?
>> 
>> CompressV / ExpandV only accepts two inputs, vector to be operated on and mask under which operation is performed, permute table based implementation is specific to x86 backend implementation.
>
> @jatin-bhateja I think you can expand them in the matcher into several `MachNode`s that will get scheduled separately.

> Exactly, like @merykitty suggests: you can do a platform-dependent expansion.

Hi @merykitty , @eme64 , in principle platform specific lowering is a good idea where ever useful, our main concern here is to identify a loop invariant constant mask in matcher patterns and save the cost of re-loading from a permute table index. Existing loop invariant analysis moves invariant masks out of loop and GCM should be able to move expanded load from permute table out of loop. 

But this looks very restrictive and will mainly be useful for constant one hot bit mask pattern. A constant mask may have more than one set bits and in such a case we will need to generate multiple loads from permute tables and handle multiple expansion scenarios. I think we can defer that complexity for that time being.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17261#issuecomment-1882549544