RFR: 8354473: Incorrect results for compress/expand tests with -XX:+EnableX86ECoreOpts

Sat May 3 07:45:45 UTC 2025

On Thu, 1 May 2025 06:17:44 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>>> @vpaprotsk Can you please give a little more details about what exactly went wrong here, and why your change is correct?
>> 
>> @eme64 Thanks for looking. Point form in attempt to be concise:
>> - Jatin brought this to my attention, we weren't sure whose code was at fault (i.e. I wrote the blend emulation, he wrote the compress_expand) and I got to the investigation first (i.e. see https://bugs.openjdk.org/browse/JDK-8354473)
>> - The mask for vblendvps instruction; actual instruction only cares about the MSB but for emulation we must have the mask to be either `FFF..FF` or `000..00`. In many places blend is used, this is already the case, so no need to recompute the mask. That's why the flag is provided (i.e. optimization). 
>> - (Without fully understanding the entirety of compress_expand), it appears to me that in this function the mask in `permv` _must_ be computed explicitly. That's why the flag is changed.
>
>> > @vpaprotsk Can you please give a little more details about what exactly went wrong here, and why your change is correct?
>> 
>> @eme64 Thanks for looking. Point form in attempt to be concise:
>> 
>> * Jatin brought this to my attention, we weren't sure whose code was at fault (i.e. I wrote the blend emulation, he wrote the compress_expand) and I got to the investigation first (i.e. see https://bugs.openjdk.org/browse/JDK-8354473)
>> * The mask for vblendvps instruction; actual instruction only cares about the MSB but for emulation we must have the mask to be either `FFF..FF` or `000..00`. In many places blend is used, this is already the case, so no need to recompute the mask. That's why the flag is provided (i.e. optimization).
>> * (Without fully understanding the entirety of compress_expand), it appears to me that in this function the mask in `permv` _must_ be computed explicitly. That's why the flag is changed.
> 
> Hi @vpaprotsk , @eme64,
> 
> Just to fill in the missing details about compress/expand handling on AVX2, we maintain an in-memory lookup table of permutation indices corresponding to a mask value. Each row of lookup table either holds a valid permute index, which is a positive index value less than the vector lane count OR a -1 index.
> 
> Since blend emulation always expects to operate over a blend mask vector whose lanes either hold a -1 or a 0 value hence there is a need to re-compose the desired blend mask by signed extending the MSB bits to fill the entire lane. Your fix to recompute the mask looks good to me. 
> 
> 
> Best Regards,
> Jatin

> > @jatin-bhateja It seems the flag `-XX:+EnableX86ECoreOpts` only is enabled on some very specific machines. How important / wide spread are these machines? Will they become more wide spread over time? Or is this rather rare, and not worth investing too many resources? How does their importance compare to AVX and AVX2, or machines with only SSE2 or SSE4.1? Because we put a focus on SSE/AVX in internal testing, but I'm wondering if we should also test `EnableX86ECoreOpts` more. How does this flag interact with AVX features? Do ECore machines always have AVX2 for example? What would be good flag combinations here?
> 
> Testing with EnableX86ECoreOpts would be good, these machines have AVX2.

>>>  How important / wide spread are these machines? Will they become more wide spread over time? Or is this rather rare, and not worth investing too many resources?

E-core Xeons, Sapphire Rapids is widely deployed by all major CSPs,  -XX:+EnableX86ECoreOpts enables certain micro-architectural optimization for these systems and JIT code may be different than using -XX:UseAVX=2 on regular P-core Xeons (AVX512 family) targets.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24645#issuecomment-2848489508