RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v6]

Fri Aug 5 13:26:56 UTC 2022

On Fri, 5 Aug 2022 10:06:53 GMT, Hao Sun <haosun at openjdk.org> wrote:

>> **MOTIVATION**
>> 
>> This is a big refactoring patch of merging rules in aarch64_sve.ad and
>> aarch64_neon.ad. The motivation can also be found at [1].
>> 
>> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
>> and NEON codegen respectively. 1) For SVE rules we use vReg operand to
>> match VecA for an arbitrary length of vector type, when SVE is enabled;
>> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
>> 128-bit/64-bit vectors, when SVE is not enabled.
>> 
>> This separation looked clean at the time of introducing SVE support.
>> However, there are two main drawbacks now.
>> 
>> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
>> SVE vector registers share the lower 128 bits with NEON registers. For
>> some cases, even when SVE is enabled, we still prefer to match NEON
>> rules and emit NEON instructions.
>> 
>> **Drawback-2**: With more and more vector rules added to support VectorAPI,
>> there are lots of rules in both two ad files with different predication
>> conditions, e.g., different values of UseSVE or vector type/size.
>> 
>> Examples can be found in [1]. These two drawbacks make the code less
>> maintainable and increase the libjvm.so code size.
>> 
>> **KEY UPDATES**
>> 
>> In this patch, we mainly do two things, using generic vReg to match all
>> NEON/SVE vector registers and merging NEON/SVE matching rules.
>> 
>> - Update-1: Use generic vReg to match all NEON/SVE vector registers
>> 
>> Two different approaches were considered, and we prefer to use generic
>> vector solution but keep VecA operand for all >128-bit vectors. See the
>> last slide in [1]. All the changes lie in the AArch64 backend.
>> 
>> 1) Some helpers are updated in aarch64.ad to enable generic vector on
>> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
>> is_reg2reg_move() and is_generic_vector().
>> 
>> 2) Operand vecA is created to match VecA register, and vReg is updated
>> to match VecA/D/X registers dynamically.
>> 
>> With the introduction of generic vReg, difference in register types
>> between NEON rules and SVE rules can be eliminated, which makes it easy
>> to merge these rules.
>> 
>> - Update-2: Try to merge existing rules
>> 
>> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
>> introduced to hold the grouped and merged matching rules.
>> 
>> 1) Similar rules with difference in vector type/size can be merged into
>> new rules, where different types and vector sizes are handled in the
>> codegen part, e.g., vadd(). This resolves **Drawback-2**.
>> 
>> 2) In most cases, we tend to emit NEON instructions for 128-bit vector
>> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**.
>> 
>> It's important to note that there are some exceptions.
>> 
>> Exception-1: For some rules, there are no direct NEON instructions, but
>> exists simple SVE implementation due to newly added SVE ISA. Such rules
>> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
>> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
>> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
>> 
>> Exception-2: Vector mask generation and operation rules are different
>> because vector mask is stored in different types of registers between
>> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
>> 
>> Exception-3: Shift right related rules are different because vector
>> shift right instructions differ a bit between NEON and SVE.
>> 
>> For these exceptions, we emit NEON or SVE code simply based on UseSVE
>> options.
>> 
>> **MINOR UPDATES and CODE REFACTORING**
>> 
>> Since we've touched all lines of code during merging rules, we further
>> do more minor updates and refactoring.
>> 
>> - Reduce regmask bits
>> 
>> Stack slot alignment is handled specially for scalable vector, which
>> will firstly align to SlotsPerVecA, and then align to the real vector
>> length. We should guarantee SlotsPerVecA is no bigger than the real
>> vector length. Otherwise, unused stack space would be allocated.
>> 
>> In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
>> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
>> on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
>> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
>> from 8 to 4.
>> 
>> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
>> (chunk1 and vectora_reg).
>> 
>> - Refactor NEON/SVE vector op support check.
>> 
>> Merge NEON and SVE vector supported check into one single function. To
>> be consistent, SVE default size supported check now is relaxed from no
>> less than 64 bits to the same condition as NEON's min_vector_size(),
>> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
>> as we assume at least we will emit NEON code for those small vectors,
>> with unified rules.
>> 
>> - Some notes for new rules
>> 
>> 1) Since new rules are unique and it makes no sense to set different
>> "ins_cost", we turn to use the default cost.
>> 
>> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
>> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
>> can be removed.
>> 
>> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
>> matching rule names if needed.
>> a) 'le128b' means the vector length is less than or equal to 128 bits.
>> This rule can be matched on both NEON and 128-bit SVE.
>> b) 'gt128b' means the vector length is greater than 128 bits. This rule
>> can only be matched on SVE.
>> c) 'neon' means this rule can only be matched on NEON, i.e. the
>> generated instruction is not better than those in 128-bit SVE.
>> d) 'sve' means this rule is only matched on SVE for all possible vector
>> length, i.e. not limited to gt128b.
>> 
>> Note-1: m4 file is not introduced because many duplications are highly
>> reduced now.
>> Note-2: We guess the code review for this big patch would probably take
>> some time and we may need to merge latest code from master branch from
>> time to time. We prefer to keep aarch64_neon/sve.ad and the
>> corresponding m4 files for easy comparison and review. Of course, they
>> will be finally removed after some solid reviews before integration.
>> Note-3: Several other minor refactorings are done in this patch, but we
>> cannot list all of them in the commit message. We have reviewed and
>> tested the rules carefully to guarantee the quality.
>> 
>> **TESTING**
>> 
>> 1) Cross compilations on arm32/s390/pps/riscv passed.
>> 2) tier1~3 jtreg passed on both x64 and aarch64 machines.
>> 3) vector tests: all the test cases under the following directories can
>> pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
>> 
>>   "test/hotspot/jtreg/compiler/vectorapi/"
>>   "test/jdk/jdk/incubator/vector/"
>>   "test/hotspot/jtreg/compiler/vectorization/"
>> 
>> 4) Performance evaluation: we choose vector micro-benchmarks from
>> panama-vector:vectorIntrinsics [2] to evaluate the performance of this
>> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
>> platform and one NEON platform, and didn't see any visiable regression
>> with NEON and SVE. We will continue to verify more cases on other
>> platforms with NEON and different SVE vector sizes.
>> 
>> **BENEFITS**
>> 
>> The number of matching rules is reduced to ~ **42%**.
>>   before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
>>   after   : 313 (aarch64_vector.ad)
>> 
>> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**.
>>   before: 25246528 B (commit 7905788e969)
>>   after   : 24208776 B (**nearly 1 MB reduction**)
>> 
>> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
>> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
>> 
>> Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
>> Co-Developed-by: Eric Liu <eric.c.liu at arm.com>
>
> Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits:
> 
>  - Merge with jdk/master
>  - Remove unused ad and m4 files
>  - Fix Jtreg failures on NEON with -XX:MaxVectorSize=8
>    
>    [Byte|Short|Int|Long|Float|Double]128VectorTests.java failed on NEON
>    platform with -XX:MaxVectorSize=8.
>    
>    The root cause lies in that 128-bit vector MulReduction operations would
>    pass the check in match_rule_supported_vector(), i.e. returning true if
>    lengh_in_bytes is 16, which should be invalid since MaxVectorSize=8.
>    
>    In this patch, we improve the check logic a bit, i.e. failing fast if
>    the conditions are not satisfied and falling through to check
>    vector_size_supported() eventually.
>    
>    Similar update is done to the check for Op_MulAddVS2VI.
>  - Improve comments and fix typos
>    
>    1. Improve the comment for MulReductionV*.
>    
>    2. Fix typos for REDUCE_BITWISE_OP_XX macros.
>    
>    3. For the typo in vector mask load/store rules.
>  - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules
>    
>    Merge branch "master".
>  - Add m4 file
>    
>    Add the corresponding M4 file
>  - Add VM_Version flag to control NEON instruction generation
>    
>    Add VM_Version flag use_neon_for_vector() to control whether to generate
>    NEON instructions for 128-bit vector operations.
>    
>    Currently only vector length is checked inside and it returns true for
>    existing SVE cores. More specific things might be checked in the near
>    future, e.g., the basic data type or SVE CPU model.
>    
>    Besides, new macro assembler helpers neon_vector_extend/narrow() are
>    introduced to make the code clean.
>    
>    Note: AddReductionVF/D rules are updated so that SVE instructions are
>    generated for 64/128-bit vector operations, because floating point
>    reduction add instructions are supported directly in SVE.
>  - Merge branch 'master' as of 7th-July into 8285790-merge-rules
>  - 8285790: AArch64: Merge C2 NEON and SVE matching rules
>    
>    MOTIVATION
>    
>    This is a big refactoring patch of merging rules in aarch64_sve.ad and
>    aarch64_neon.ad. The motivation can also be found at [1].
>    
>    Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
>    and NEON codegen respectively. 1) For SVE rules we use vReg operand to
>    match VecA for an arbitrary length of vector type, when SVE is enabled;
>    2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
>    128-bit/64-bit vectors, when SVE is not enabled.
>    
>    This separation looked clean at the time of introducing SVE support.
>    However, there are two main drawbacks now.
>    
>    Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
>    SVE vector registers share the lower 128 bits with NEON registers. For
>    some cases, even when SVE is enabled, we still prefer to match NEON
>    rules and emit NEON instructions.
>    
>    Drawback-2: With more and more vector rules added to support VectorAPI,
>    there are lots of rules in both two ad files with different predication
>    conditions, e.g., different values of UseSVE or vector type/size.
>    
>    Examples can be found in [1]. These two drawbacks make the code less
>    maintainable and increase the libjvm.so code size.
>    
>    KEY UPDATES
>    
>    In this patch, we mainly do two things, using generic vReg to match all
>    NEON/SVE vector registers and merging NEON/SVE matching rules.
>    
>    Update-1: Use generic vReg to match all NEON/SVE vector registers
>    
>    Two different approaches were considered, and we prefer to use generic
>    vector solution but keep VecA operand for all >128-bit vectors. See the
>    last slide in [1]. All the changes lie in the AArch64 backend.
>    
>    1) Some helpers are updated in aarch64.ad to enable generic vector on
>    AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
>    is_reg2reg_move() and is_generic_vector().
>    
>    2) Operand vecA is created to match VecA register, and vReg is updated
>    to match VecA/D/X registers dynamically.
>    
>    With the introduction of generic vReg, difference in register types
>    between NEON rules and SVE rules can be eliminated, which makes it easy
>    to merge these rules.
>    
>    Update-2: Try to merge existing rules
>    
>    As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
>    introduced to hold the grouped and merged matching rules.
>    
>    1) Similar rules with difference in vector type/size can be merged into
>    new rules, where different types and vector sizes are handled in the
>    codegen part, e.g., vadd(). This resolves Drawback-2.
>    
>    2) In most cases, we tend to emit NEON instructions for 128-bit vector
>    operations on SVE platforms, e.g., vadd(). This resolves Drawback-1.
>    
>    It's important to note that there are some exceptions.
>    
>    Exception-1: For some rules, there are no direct NEON instructions, but
>    exists simple SVE implementation due to newly added SVE ISA. Such rules
>    include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
>    reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
>    reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
>    
>    Exception-2: Vector mask generation and operation rules are different
>    because vector mask is stored in different types of registers between
>    NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
>    
>    Exception-3: Shift right related rules are different because vector
>    shift right instructions differ a bit between NEON and SVE.
>    
>    For these exceptions, we emit NEON or SVE code simply based on UseSVE
>    options.
>    
>    MINOR UPDATES and CODE REFACTORING
>    
>    Since we've touched all lines of code during merging rules, we further
>    do more minor updates and refactoring.
>    
>    1. Reduce regmask bits
>    
>    Stack slot alignment is handled specially for scalable vector, which
>    will firstly align to SlotsPerVecA, and then align to the real vector
>    length. We should guarantee SlotsPerVecA is no bigger than the real
>    vector length. Otherwise, unused stack space would be allocated.
>    
>    In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
>    SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
>    on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
>    leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
>    from 8 to 4.
>    
>    See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
>    (chunk1 and vectora_reg).
>    
>    2. Refactor NEON/SVE vector op support check.
>    
>    Merge NEON and SVE vector supported check into one single function. To
>    be consistent, SVE default size supported check now is relaxed from no
>    less than 64 bits to the same condition as NEON's min_vector_size(),
>    i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
>    as we assume at least we will emit NEON code for those small vectors,
>    with unified rules.
>    
>    3. Some notes for new rules
>    
>    1) Since new rules are unique and it makes no sense to set different
>    "ins_cost", we turn to use the default cost.
>    
>    2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
>    now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
>    can be removed.
>    
>    3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
>    matching rule names if needed.
>    a) 'le128b' means the vector length is less than or equal to 128 bits.
>    This rule can be matched on both NEON and 128-bit SVE.
>    b) 'gt128b' means the vector length is greater than 128 bits. This rule
>    can only be matched on SVE.
>    c) 'neon' means this rule can only be matched on NEON, i.e. the
>    generated instruction is not better than those in 128-bit SVE.
>    d) 'sve' means this rule is only matched on SVE for all possible vector
>    length, i.e. not limited to gt128b.
>    
>    Note-1: m4 file is not introduced because many duplications are highly
>    reduced now.
>    Note-2: We guess the code review for this big patch would probably take
>    some time and we may need to merge latest code from master branch from
>    time to time. We prefer to keep aarch64_neon/sve.ad and the
>    corresponding m4 files for easy comparison and review. Of course, they
>    will be finally removed after some solid reviews before integration.
>    Note-3: Several other minor refactorings are done in this patch, but we
>    cannot list all of them in the commit message. We have reviewed and
>    tested the rules carefully to guarantee the quality.
>    
>    TESTING
>    
>    1) Cross compilations on arm32/s390/pps/riscv passed.
>    2) tier1~3 jtreg passed on both x64 and aarch64 machines.
>    3) vector tests: all the test cases under the following directories can
>    pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
>    
>      "test/hotspot/jtreg/compiler/vectorapi/"
>      "test/jdk/jdk/incubator/vector/"
>      "test/hotspot/jtreg/compiler/vectorization/"
>    
>    4) Performance evaluation: we choose vector micro-benchmarks from
>    panama-vector:vectorIntrinsics [2] to evaluate the performance of this
>    patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
>    platform and one NEON platform, and didn't see any visiable regression
>    with NEON and SVE. We will continue to verify more cases on other
>    platforms with NEON and different SVE vector sizes.
>    
>    BENEFITS
>    
>    The number of matching rules is reduced to ~ 42%.
>      before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
>      after : 313(aarch64_vector.ad)
>    
>    Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%.
>      before: 25246528 B (commit 7905788e969)
>      after : 24208776 B (nearly 1 MB reduction)
>    
>    [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
>    [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
>    
>    Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
>    Co-Developed-by: Eric Liu <eric.c.liu at arm.com>

I submitted testing and will report back once it finished.

-------------

PR: https://git.openjdk.org/jdk/pull/9346