RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3]

Wed Jul 27 04:12:06 UTC 2022

On Tue, 26 Jul 2022 15:49:06 GMT, Andrew Dinn <adinn at openjdk.org> wrote:

>> Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits:
>> 
>>  - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules
>>    
>>    Merge branch "master".
>>  - Add m4 file
>>    
>>    Add the corresponding M4 file
>>  - Add VM_Version flag to control NEON instruction generation
>>    
>>    Add VM_Version flag use_neon_for_vector() to control whether to generate
>>    NEON instructions for 128-bit vector operations.
>>    
>>    Currently only vector length is checked inside and it returns true for
>>    existing SVE cores. More specific things might be checked in the near
>>    future, e.g., the basic data type or SVE CPU model.
>>    
>>    Besides, new macro assembler helpers neon_vector_extend/narrow() are
>>    introduced to make the code clean.
>>    
>>    Note: AddReductionVF/D rules are updated so that SVE instructions are
>>    generated for 64/128-bit vector operations, because floating point
>>    reduction add instructions are supported directly in SVE.
>>  - Merge branch 'master' as of 7th-July into 8285790-merge-rules
>>  - 8285790: AArch64: Merge C2 NEON and SVE matching rules
>>    
>>    MOTIVATION
>>    
>>    This is a big refactoring patch of merging rules in aarch64_sve.ad and
>>    aarch64_neon.ad. The motivation can also be found at [1].
>>    
>>    Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
>>    and NEON codegen respectively. 1) For SVE rules we use vReg operand to
>>    match VecA for an arbitrary length of vector type, when SVE is enabled;
>>    2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
>>    128-bit/64-bit vectors, when SVE is not enabled.
>>    
>>    This separation looked clean at the time of introducing SVE support.
>>    However, there are two main drawbacks now.
>>    
>>    Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
>>    SVE vector registers share the lower 128 bits with NEON registers. For
>>    some cases, even when SVE is enabled, we still prefer to match NEON
>>    rules and emit NEON instructions.
>>    
>>    Drawback-2: With more and more vector rules added to support VectorAPI,
>>    there are lots of rules in both two ad files with different predication
>>    conditions, e.g., different values of UseSVE or vector type/size.
>>    
>>    Examples can be found in [1]. These two drawbacks make the code less
>>    maintainable and increase the libjvm.so code size.
>>    
>>    KEY UPDATES
>>    
>>    In this patch, we mainly do two things, using generic vReg to match all
>>    NEON/SVE vector registers and merging NEON/SVE matching rules.
>>    
>>    Update-1: Use generic vReg to match all NEON/SVE vector registers
>>    
>>    Two different approaches were considered, and we prefer to use generic
>>    vector solution but keep VecA operand for all >128-bit vectors. See the
>>    last slide in [1]. All the changes lie in the AArch64 backend.
>>    
>>    1) Some helpers are updated in aarch64.ad to enable generic vector on
>>    AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
>>    is_reg2reg_move() and is_generic_vector().
>>    
>>    2) Operand vecA is created to match VecA register, and vReg is updated
>>    to match VecA/D/X registers dynamically.
>>    
>>    With the introduction of generic vReg, difference in register types
>>    between NEON rules and SVE rules can be eliminated, which makes it easy
>>    to merge these rules.
>>    
>>    Update-2: Try to merge existing rules
>>    
>>    As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
>>    introduced to hold the grouped and merged matching rules.
>>    
>>    1) Similar rules with difference in vector type/size can be merged into
>>    new rules, where different types and vector sizes are handled in the
>>    codegen part, e.g., vadd(). This resolves Drawback-2.
>>    
>>    2) In most cases, we tend to emit NEON instructions for 128-bit vector
>>    operations on SVE platforms, e.g., vadd(). This resolves Drawback-1.
>>    
>>    It's important to note that there are some exceptions.
>>    
>>    Exception-1: For some rules, there are no direct NEON instructions, but
>>    exists simple SVE implementation due to newly added SVE ISA. Such rules
>>    include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
>>    reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
>>    reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
>>    
>>    Exception-2: Vector mask generation and operation rules are different
>>    because vector mask is stored in different types of registers between
>>    NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
>>    
>>    Exception-3: Shift right related rules are different because vector
>>    shift right instructions differ a bit between NEON and SVE.
>>    
>>    For these exceptions, we emit NEON or SVE code simply based on UseSVE
>>    options.
>>    
>>    MINOR UPDATES and CODE REFACTORING
>>    
>>    Since we've touched all lines of code during merging rules, we further
>>    do more minor updates and refactoring.
>>    
>>    1. Reduce regmask bits
>>    
>>    Stack slot alignment is handled specially for scalable vector, which
>>    will firstly align to SlotsPerVecA, and then align to the real vector
>>    length. We should guarantee SlotsPerVecA is no bigger than the real
>>    vector length. Otherwise, unused stack space would be allocated.
>>    
>>    In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
>>    SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
>>    on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
>>    leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
>>    from 8 to 4.
>>    
>>    See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
>>    (chunk1 and vectora_reg).
>>    
>>    2. Refactor NEON/SVE vector op support check.
>>    
>>    Merge NEON and SVE vector supported check into one single function. To
>>    be consistent, SVE default size supported check now is relaxed from no
>>    less than 64 bits to the same condition as NEON's min_vector_size(),
>>    i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
>>    as we assume at least we will emit NEON code for those small vectors,
>>    with unified rules.
>>    
>>    3. Some notes for new rules
>>    
>>    1) Since new rules are unique and it makes no sense to set different
>>    "ins_cost", we turn to use the default cost.
>>    
>>    2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
>>    now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
>>    can be removed.
>>    
>>    3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
>>    matching rule names if needed.
>>    a) 'le128b' means the vector length is less than or equal to 128 bits.
>>    This rule can be matched on both NEON and 128-bit SVE.
>>    b) 'gt128b' means the vector length is greater than 128 bits. This rule
>>    can only be matched on SVE.
>>    c) 'neon' means this rule can only be matched on NEON, i.e. the
>>    generated instruction is not better than those in 128-bit SVE.
>>    d) 'sve' means this rule is only matched on SVE for all possible vector
>>    length, i.e. not limited to gt128b.
>>    
>>    Note-1: m4 file is not introduced because many duplications are highly
>>    reduced now.
>>    Note-2: We guess the code review for this big patch would probably take
>>    some time and we may need to merge latest code from master branch from
>>    time to time. We prefer to keep aarch64_neon/sve.ad and the
>>    corresponding m4 files for easy comparison and review. Of course, they
>>    will be finally removed after some solid reviews before integration.
>>    Note-3: Several other minor refactorings are done in this patch, but we
>>    cannot list all of them in the commit message. We have reviewed and
>>    tested the rules carefully to guarantee the quality.
>>    
>>    TESTING
>>    
>>    1) Cross compilations on arm32/s390/pps/riscv passed.
>>    2) tier1~3 jtreg passed on both x64 and aarch64 machines.
>>    3) vector tests: all the test cases under the following directories can
>>    pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
>>    
>>      "test/hotspot/jtreg/compiler/vectorapi/"
>>      "test/jdk/jdk/incubator/vector/"
>>      "test/hotspot/jtreg/compiler/vectorization/"
>>    
>>    4) Performance evaluation: we choose vector micro-benchmarks from
>>    panama-vector:vectorIntrinsics [2] to evaluate the performance of this
>>    patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
>>    platform and one NEON platform, and didn't see any visiable regression
>>    with NEON and SVE. We will continue to verify more cases on other
>>    platforms with NEON and different SVE vector sizes.
>>    
>>    BENEFITS
>>    
>>    The number of matching rules is reduced to ~ 42%.
>>      before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
>>      after : 313(aarch64_vector.ad)
>>    
>>    Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%.
>>      before: 25246528 B (commit 7905788e969)
>>      after : 24208776 B (nearly 1 MB reduction)
>>    
>>    [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
>>    [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
>>    
>>    Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
>>    Co-Developed-by: Eric Liu <eric.c.liu at arm.com>
>
> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 169:
> 
>> 167:       case Op_MulReductionVI:
>> 168:       case Op_MulReductionVL:
>> 169:         // No multiply reduction instructions, and we emit scalar
> 
> This comment is a little unclear. Is this what you actually mean?
> 
> "No vector multiply reduction instructions, but we do emit scalar instructions for 64/128-bit vectors"

Yes. You're right. Updated in the new revision. Thanks.

> src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3026:
> 
>> 3024: EXTRACT_FP(D, fmovd, 2, D, 3)
>> 3025: 
>> 3026: // ------------------------------ Vector mask loat/store -----------------------
> 
> Should be "load/store"

Yes, it's a typo. Updated in the new revision.

-------------

PR: https://git.openjdk.org/jdk/pull/9346