RFR: 8280510: AArch64: Vectorize operations with loop induction variable

Mon Mar 14 01:45:40 UTC 2022

On Wed, 16 Feb 2022 08:26:14 GMT, Pengfei Li <pli at openjdk.org> wrote:

> AArch64 has SVE instruction of populating incrementing indices into an
> SVE vector register. With this we can vectorize some operations in loop
> with the induction variable operand, such as below.
> 
>   for (int i = 0; i < count; i++) {
>     b[i] = a[i] * i;
>   }
> 
> This patch enables the vectorization of operations with loop induction
> variable by extending current scope of C2 superword vectorizable packs.
> Before this patch, any scalar input node in a vectorizable pack must be
> an out-of-loop invariant. This patch takes the induction variable input
> as consideration. It allows the input to be the iv phi node or phi plus
> its index offset, and creates a `PopulateIndexNode` to generate a vector
> filled with incrementing indices. On AArch64 SVE, final generated code
> for above loop expression is like below.
> 
>   add     x12, x16, x10
>   add     x12, x12, #0x10
>   ld1w    {z16.s}, p7/z, [x12]
>   index   z17.s, w1, #1
>   mul     z17.s, p7/m, z17.s, z16.s
>   add     x10, x17, x10
>   add     x10, x10, #0x10
>   st1w    {z17.s}, p7, [x10]
> 
> As there is no populating index instruction on AArch64 NEON or other
> platforms like x86, a function named `is_populate_index_supported()` is
> created in the VectorNode class for the backend support check.
> 
> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1
> are tested and no issue is found. Hotspot jtreg has existing tests in
> `compiler/c2/cr7192963/Test*Vect.java` covering this kind of use cases so
> no new jtreg is created within this patch. A new JMH is created in this
> patch and tested on a 512-bit SVE machine. Below test result shows the
> performance can be significantly improved in some cases.
> 
>   Benchmark                       Performance
>   IndexVector.exprWithIndex1            ~7.7x
>   IndexVector.exprWithIndex2           ~13.3x
>   IndexVector.indexArrayFill            ~5.7x

Can anyone help review this? This optimization is currently for AArch64 but changes are mainly in C2 mid-end.

-------------

PR: https://git.openjdk.java.net/jdk/pull/7491