RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v7]

Tue Aug 16 09:02:57 UTC 2022

> **MOTIVATION**
> 
> This is a big refactoring patch of merging rules in aarch64_sve.ad and
> aarch64_neon.ad. The motivation can also be found at [1].
> 
> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
> and NEON codegen respectively. 1) For SVE rules we use vReg operand to
> match VecA for an arbitrary length of vector type, when SVE is enabled;
> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
> 128-bit/64-bit vectors, when SVE is not enabled.
> 
> This separation looked clean at the time of introducing SVE support.
> However, there are two main drawbacks now.
> 
> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
> SVE vector registers share the lower 128 bits with NEON registers. For
> some cases, even when SVE is enabled, we still prefer to match NEON
> rules and emit NEON instructions.
> 
> **Drawback-2**: With more and more vector rules added to support VectorAPI,
> there are lots of rules in both two ad files with different predication
> conditions, e.g., different values of UseSVE or vector type/size.
> 
> Examples can be found in [1]. These two drawbacks make the code less
> maintainable and increase the libjvm.so code size.
> 
> **KEY UPDATES**
> 
> In this patch, we mainly do two things, using generic vReg to match all
> NEON/SVE vector registers and merging NEON/SVE matching rules.
> 
> - Update-1: Use generic vReg to match all NEON/SVE vector registers
> 
> Two different approaches were considered, and we prefer to use generic
> vector solution but keep VecA operand for all >128-bit vectors. See the
> last slide in [1]. All the changes lie in the AArch64 backend.
> 
> 1) Some helpers are updated in aarch64.ad to enable generic vector on
> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
> is_reg2reg_move() and is_generic_vector().
> 
> 2) Operand vecA is created to match VecA register, and vReg is updated
> to match VecA/D/X registers dynamically.
> 
> With the introduction of generic vReg, difference in register types
> between NEON rules and SVE rules can be eliminated, which makes it easy
> to merge these rules.
> 
> - Update-2: Try to merge existing rules
> 
> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
> introduced to hold the grouped and merged matching rules.
> 
> 1) Similar rules with difference in vector type/size can be merged into
> new rules, where different types and vector sizes are handled in the
> codegen part, e.g., vadd(). This resolves **Drawback-2**.
> 
> 2) In most cases, we tend to emit NEON instructions for 128-bit vector
> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**.
> 
> It's important to note that there are some exceptions.
> 
> Exception-1: For some rules, there are no direct NEON instructions, but
> exists simple SVE implementation due to newly added SVE ISA. Such rules
> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
> 
> Exception-2: Vector mask generation and operation rules are different
> because vector mask is stored in different types of registers between
> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
> 
> Exception-3: Shift right related rules are different because vector
> shift right instructions differ a bit between NEON and SVE.
> 
> For these exceptions, we emit NEON or SVE code simply based on UseSVE
> options.
> 
> **MINOR UPDATES and CODE REFACTORING**
> 
> Since we've touched all lines of code during merging rules, we further
> do more minor updates and refactoring.
> 
> - Reduce regmask bits
> 
> Stack slot alignment is handled specially for scalable vector, which
> will firstly align to SlotsPerVecA, and then align to the real vector
> length. We should guarantee SlotsPerVecA is no bigger than the real
> vector length. Otherwise, unused stack space would be allocated.
> 
> In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
> on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
> from 8 to 4.
> 
> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
> (chunk1 and vectora_reg).
> 
> - Refactor NEON/SVE vector op support check.
> 
> Merge NEON and SVE vector supported check into one single function. To
> be consistent, SVE default size supported check now is relaxed from no
> less than 64 bits to the same condition as NEON's min_vector_size(),
> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
> as we assume at least we will emit NEON code for those small vectors,
> with unified rules.
> 
> - Some notes for new rules
> 
> 1) Since new rules are unique and it makes no sense to set different
> "ins_cost", we turn to use the default cost.
> 
> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
> can be removed.
> 
> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
> matching rule names if needed.
> a) 'le128b' means the vector length is less than or equal to 128 bits.
> This rule can be matched on both NEON and 128-bit SVE.
> b) 'gt128b' means the vector length is greater than 128 bits. This rule
> can only be matched on SVE.
> c) 'neon' means this rule can only be matched on NEON, i.e. the
> generated instruction is not better than those in 128-bit SVE.
> d) 'sve' means this rule is only matched on SVE for all possible vector
> length, i.e. not limited to gt128b.
> 
> Note-1: m4 file is not introduced because many duplications are highly
> reduced now.
> Note-2: We guess the code review for this big patch would probably take
> some time and we may need to merge latest code from master branch from
> time to time. We prefer to keep aarch64_neon/sve.ad and the
> corresponding m4 files for easy comparison and review. Of course, they
> will be finally removed after some solid reviews before integration.
> Note-3: Several other minor refactorings are done in this patch, but we
> cannot list all of them in the commit message. We have reviewed and
> tested the rules carefully to guarantee the quality.
> 
> **TESTING**
> 
> 1) Cross compilations on arm32/s390/pps/riscv passed.
> 2) tier1~3 jtreg passed on both x64 and aarch64 machines.
> 3) vector tests: all the test cases under the following directories can
> pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
> 
>   "test/hotspot/jtreg/compiler/vectorapi/"
>   "test/jdk/jdk/incubator/vector/"
>   "test/hotspot/jtreg/compiler/vectorization/"
> 
> 4) Performance evaluation: we choose vector micro-benchmarks from
> panama-vector:vectorIntrinsics [2] to evaluate the performance of this
> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
> platform and one NEON platform, and didn't see any visiable regression
> with NEON and SVE. We will continue to verify more cases on other
> platforms with NEON and different SVE vector sizes.
> 
> **BENEFITS**
> 
> The number of matching rules is reduced to ~ **42%**.
>   before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
>   after   : 313 (aarch64_vector.ad)
> 
> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**.
>   before: 25246528 B (commit 7905788e969)
>   after   : 24208776 B (**nearly 1 MB reduction**)
> 
> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
> 
> Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
> Co-Developed-by: Eric Liu <eric.c.liu at arm.com>

Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits:

 - Merge branch 'master' as of 10th-Aug into 8285790-merge-rules

   Resolve the conflict in file c2_MacroAssembler_aarch64.hpp.
 - Merge with jdk/master
 - Remove unused ad and m4 files
 - Fix Jtreg failures on NEON with -XX:MaxVectorSize=8

   The root cause lies in that 128-bit vector MulReduction operations would
   pass the check in match_rule_supported_vector(), i.e. returning true if
   lengh_in_bytes is 16, which should be invalid since MaxVectorSize=8.

   In this patch, we improve the check logic a bit, i.e. failing fast if
   the conditions are not satisfied and falling through to check
   vector_size_supported() eventually.

   Similar update is done to the check for Op_MulAddVS2VI.
 - Improve comments and fix typos

   1. Improve the comment for MulReductionV*.

   2. Fix typos for REDUCE_BITWISE_OP_XX macros.

   3. For the typo in vector mask load/store rules.
 - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules

   Merge branch "master".
 - Add m4 file

   Add the corresponding M4 file
 - Add VM_Version flag to control NEON instruction generation

   Add VM_Version flag use_neon_for_vector() to control whether to generate
   NEON instructions for 128-bit vector operations.

   Currently only vector length is checked inside and it returns true for
   existing SVE cores. More specific things might be checked in the near
   future, e.g., the basic data type or SVE CPU model.

   Besides, new macro assembler helpers neon_vector_extend/narrow() are
   introduced to make the code clean.

   Note: AddReductionVF/D rules are updated so that SVE instructions are
   generated for 64/128-bit vector operations, because floating point
   reduction add instructions are supported directly in SVE.
 - Merge branch 'master' as of 7th-July into 8285790-merge-rules
 - 8285790: AArch64: Merge C2 NEON and SVE matching rules

   MOTIVATION

   This is a big refactoring patch of merging rules in aarch64_sve.ad and
   aarch64_neon.ad. The motivation can also be found at [1].

   Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
   and NEON codegen respectively. 1) For SVE rules we use vReg operand to
   match VecA for an arbitrary length of vector type, when SVE is enabled;
   2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
   128-bit/64-bit vectors, when SVE is not enabled.

   This separation looked clean at the time of introducing SVE support.
   However, there are two main drawbacks now.

   Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
   SVE vector registers share the lower 128 bits with NEON registers. For
   some cases, even when SVE is enabled, we still prefer to match NEON
   rules and emit NEON instructions.

   Drawback-2: With more and more vector rules added to support VectorAPI,
   there are lots of rules in both two ad files with different predication
   conditions, e.g., different values of UseSVE or vector type/size.

   Examples can be found in [1]. These two drawbacks make the code less
   maintainable and increase the libjvm.so code size.

   KEY UPDATES

   In this patch, we mainly do two things, using generic vReg to match all
   NEON/SVE vector registers and merging NEON/SVE matching rules.

   Update-1: Use generic vReg to match all NEON/SVE vector registers

   Two different approaches were considered, and we prefer to use generic
   vector solution but keep VecA operand for all >128-bit vectors. See the
   last slide in [1]. All the changes lie in the AArch64 backend.

   1) Some helpers are updated in aarch64.ad to enable generic vector on
   AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
   is_reg2reg_move() and is_generic_vector().

   2) Operand vecA is created to match VecA register, and vReg is updated
   to match VecA/D/X registers dynamically.

   With the introduction of generic vReg, difference in register types
   between NEON rules and SVE rules can be eliminated, which makes it easy
   to merge these rules.

   Update-2: Try to merge existing rules

   As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
   introduced to hold the grouped and merged matching rules.

   1) Similar rules with difference in vector type/size can be merged into
   new rules, where different types and vector sizes are handled in the
   codegen part, e.g., vadd(). This resolves Drawback-2.

   2) In most cases, we tend to emit NEON instructions for 128-bit vector
   operations on SVE platforms, e.g., vadd(). This resolves Drawback-1.

   It's important to note that there are some exceptions.

   Exception-1: For some rules, there are no direct NEON instructions, but
   exists simple SVE implementation due to newly added SVE ISA. Such rules
   include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
   reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
   reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.

   Exception-2: Vector mask generation and operation rules are different
   because vector mask is stored in different types of registers between
   NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.

   Exception-3: Shift right related rules are different because vector
   shift right instructions differ a bit between NEON and SVE.

   For these exceptions, we emit NEON or SVE code simply based on UseSVE
   options.

   MINOR UPDATES and CODE REFACTORING

   Since we've touched all lines of code during merging rules, we further
   do more minor updates and refactoring.

   1. Reduce regmask bits

   Stack slot alignment is handled specially for scalable vector, which
   will firstly align to SlotsPerVecA, and then align to the real vector
   length. We should guarantee SlotsPerVecA is no bigger than the real
   vector length. Otherwise, unused stack space would be allocated.

   In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
   SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
   on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
   leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
   from 8 to 4.

   See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
   (chunk1 and vectora_reg).

   2. Refactor NEON/SVE vector op support check.

   Merge NEON and SVE vector supported check into one single function. To
   be consistent, SVE default size supported check now is relaxed from no
   less than 64 bits to the same condition as NEON's min_vector_size(),
   i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
   as we assume at least we will emit NEON code for those small vectors,
   with unified rules.

   3. Some notes for new rules

   1) Since new rules are unique and it makes no sense to set different
   "ins_cost", we turn to use the default cost.

   2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
   now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
   can be removed.

   3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
   matching rule names if needed.
   a) 'le128b' means the vector length is less than or equal to 128 bits.
   This rule can be matched on both NEON and 128-bit SVE.
   b) 'gt128b' means the vector length is greater than 128 bits. This rule
   can only be matched on SVE.
   c) 'neon' means this rule can only be matched on NEON, i.e. the
   generated instruction is not better than those in 128-bit SVE.
   d) 'sve' means this rule is only matched on SVE for all possible vector
   length, i.e. not limited to gt128b.

   Note-1: m4 file is not introduced because many duplications are highly
   reduced now.
   Note-2: We guess the code review for this big patch would probably take
   some time and we may need to merge latest code from master branch from
   time to time. We prefer to keep aarch64_neon/sve.ad and the
   corresponding m4 files for easy comparison and review. Of course, they
   will be finally removed after some solid reviews before integration.
   Note-3: Several other minor refactorings are done in this patch, but we
   cannot list all of them in the commit message. We have reviewed and
   tested the rules carefully to guarantee the quality.

   TESTING

   1) Cross compilations on arm32/s390/pps/riscv passed.
   2) tier1~3 jtreg passed on both x64 and aarch64 machines.
   3) vector tests: all the test cases under the following directories can
   pass on both NEON and SVE systems with max vector length 16/32/64 bytes.

     "test/hotspot/jtreg/compiler/vectorapi/"
     "test/jdk/jdk/incubator/vector/"
     "test/hotspot/jtreg/compiler/vectorization/"

   4) Performance evaluation: we choose vector micro-benchmarks from
   panama-vector:vectorIntrinsics [2] to evaluate the performance of this
   patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
   platform and one NEON platform, and didn't see any visiable regression
   with NEON and SVE. We will continue to verify more cases on other
   platforms with NEON and different SVE vector sizes.

   BENEFITS

   The number of matching rules is reduced to ~ 42%.
     before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
     after : 313(aarch64_vector.ad)

   Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%.
     before: 25246528 B (commit 7905788e969)
     after : 24208776 B (nearly 1 MB reduction)

   [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
   [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation

   Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
   Co-Developed-by: Eric Liu <eric.c.liu at arm.com>

-------------

Changes: https://git.openjdk.org/jdk/pull/9346/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=06
  Stats: 30563 lines in 19 files changed: 11618 ins; 18822 del; 123 mod
  Patch: https://git.openjdk.org/jdk/pull/9346.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346

PR: https://git.openjdk.org/jdk/pull/9346