[vectorIntrinsics+compress] RFR: 8276083: Incremental patch to further optimize new compress/expand APIs over X86

Thu Oct 28 16:17:32 UTC 2021

On Thu, 28 Oct 2021 15:53:35 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> Summary of changes:
>> 1)  Added both scalar and vector variants of JMH performance tests for Vector.compress/expand and VectorMask.compress APIs.
>> 
>> 2) Improved performance of operations where mask length is less than 4. Mask loading is a two stage process where in first the boolean array is loaded into a vector and then either transferred to a predicate register or a vector whose size is equivalent to that of underlined SPECIES.  A mask whose length is less than 4 will result into a less than 32 bit vector load operation.  Operations dependent on smaller masks are now being handled in java side implementation of these and some other APIs.  Since the condition for special handling and fallback logic leading to C2 intrinsic call is based on constant expression hence one of the control path is optimized out. This shall also prevent any performance penalty due to failed lazy inline expansion which most often occurs due to unsupported vector sizes.  If lazy inline expansion fails then C2 emits a direct call instruction to a callee method and thus we also loose any opportunity for procedure in-lining at that point, a separate [issu
 e ](https://bugs.openjdk.java.net/browse/JDK-8276085)has been created to address this problem.
>>  
>> 3) Improved performance of VectorMask.compress over legacy non-AVX512 targets, added the missing checks in C2Compiler::is_intrinsics_supported routine to enable procedure in-lining early during parsing if target does not support direct compress/expand instructions.  
>>     
>> 4) Inline expand VectorMask.intoArray operation to trigger boxing-unboxing optimization. This significantly improved the performance of VectorMask.compress in newly added JMH micros.
>> 
>> Following is the performance data for included JMH micros:
>> System Configuration:  Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Icelake Server 40C 2S)
>> 
>> A) VectorMask.compress:
>> 
>> ![image](https://user-images.githubusercontent.com/59989778/139234423-5c340da1-2fb1-4a21-b6d1-bb8b123b0a55.png)
>> 
>> B) Vector.compress:
>> 
>> ![image](https://user-images.githubusercontent.com/59989778/139234843-c3846b8d-dff0-4e9e-bf29-6b4dc556cd93.png)
>> 
>> C) Vector.expand:
>> 
>> ![image](https://user-images.githubusercontent.com/59989778/139234956-efad1c71-8425-4300-a8c6-c660855f5c58.png)
>> 
>> Patch has been regressed using tier3 regressions at various AVX levels 0/1/2/3/KNL.
>> 
>> Kindly review  and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/X-VectorBits.java.template line 936:
> 
>> 934:         @ForceInline
>> 935:         public $masktype$ compress() {
>> 936:             if (VLENGTH < 4) {
> 
> This impacts the code of *every* vector and across *every* architecture. I realize its dead code for many cases, but we need to find a better way to express and manage this. IMHO this is not a maintainable solution.
> 
> The fallback would be an obvious place for this logic.

We can push this into fall back path but in that case C2 will not be able to inline the fall back logic during failed lazy intrinsification attempt. I collected perf data with and without this and could clearly see benefit (around 1.5x) of keeping this outside the fall back.  Given that conditions are guarded by constant expressions so only one of the path will be jit'ed.

I think we do support 2 byte vector loads for AARCH64 but currently minimum vector size over X86 is 4 bytes. 
Agree with you,  given this is an optimization on a slow path so its ok to compromise some gain to keep consistency across architectures.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/157