RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v5]
Hao Sun
haosun at openjdk.org
Mon Aug 1 03:41:26 UTC 2022
> **MOTIVATION**
>
> This is a big refactoring patch of merging rules in aarch64_sve.ad and
> aarch64_neon.ad. The motivation can also be found at [1].
>
> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
> and NEON codegen respectively. 1) For SVE rules we use vReg operand to
> match VecA for an arbitrary length of vector type, when SVE is enabled;
> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
> 128-bit/64-bit vectors, when SVE is not enabled.
>
> This separation looked clean at the time of introducing SVE support.
> However, there are two main drawbacks now.
>
> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
> SVE vector registers share the lower 128 bits with NEON registers. For
> some cases, even when SVE is enabled, we still prefer to match NEON
> rules and emit NEON instructions.
>
> **Drawback-2**: With more and more vector rules added to support VectorAPI,
> there are lots of rules in both two ad files with different predication
> conditions, e.g., different values of UseSVE or vector type/size.
>
> Examples can be found in [1]. These two drawbacks make the code less
> maintainable and increase the libjvm.so code size.
>
> **KEY UPDATES**
>
> In this patch, we mainly do two things, using generic vReg to match all
> NEON/SVE vector registers and merging NEON/SVE matching rules.
>
> - Update-1: Use generic vReg to match all NEON/SVE vector registers
>
> Two different approaches were considered, and we prefer to use generic
> vector solution but keep VecA operand for all >128-bit vectors. See the
> last slide in [1]. All the changes lie in the AArch64 backend.
>
> 1) Some helpers are updated in aarch64.ad to enable generic vector on
> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
> is_reg2reg_move() and is_generic_vector().
>
> 2) Operand vecA is created to match VecA register, and vReg is updated
> to match VecA/D/X registers dynamically.
>
> With the introduction of generic vReg, difference in register types
> between NEON rules and SVE rules can be eliminated, which makes it easy
> to merge these rules.
>
> - Update-2: Try to merge existing rules
>
> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
> introduced to hold the grouped and merged matching rules.
>
> 1) Similar rules with difference in vector type/size can be merged into
> new rules, where different types and vector sizes are handled in the
> codegen part, e.g., vadd(). This resolves **Drawback-2**.
>
> 2) In most cases, we tend to emit NEON instructions for 128-bit vector
> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**.
>
> It's important to note that there are some exceptions.
>
> Exception-1: For some rules, there are no direct NEON instructions, but
> exists simple SVE implementation due to newly added SVE ISA. Such rules
> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
>
> Exception-2: Vector mask generation and operation rules are different
> because vector mask is stored in different types of registers between
> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
>
> Exception-3: Shift right related rules are different because vector
> shift right instructions differ a bit between NEON and SVE.
>
> For these exceptions, we emit NEON or SVE code simply based on UseSVE
> options.
>
> **MINOR UPDATES and CODE REFACTORING**
>
> Since we've touched all lines of code during merging rules, we further
> do more minor updates and refactoring.
>
> - Reduce regmask bits
>
> Stack slot alignment is handled specially for scalable vector, which
> will firstly align to SlotsPerVecA, and then align to the real vector
> length. We should guarantee SlotsPerVecA is no bigger than the real
> vector length. Otherwise, unused stack space would be allocated.
>
> In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
> on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
> from 8 to 4.
>
> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
> (chunk1 and vectora_reg).
>
> - Refactor NEON/SVE vector op support check.
>
> Merge NEON and SVE vector supported check into one single function. To
> be consistent, SVE default size supported check now is relaxed from no
> less than 64 bits to the same condition as NEON's min_vector_size(),
> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
> as we assume at least we will emit NEON code for those small vectors,
> with unified rules.
>
> - Some notes for new rules
>
> 1) Since new rules are unique and it makes no sense to set different
> "ins_cost", we turn to use the default cost.
>
> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
> can be removed.
>
> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
> matching rule names if needed.
> a) 'le128b' means the vector length is less than or equal to 128 bits.
> This rule can be matched on both NEON and 128-bit SVE.
> b) 'gt128b' means the vector length is greater than 128 bits. This rule
> can only be matched on SVE.
> c) 'neon' means this rule can only be matched on NEON, i.e. the
> generated instruction is not better than those in 128-bit SVE.
> d) 'sve' means this rule is only matched on SVE for all possible vector
> length, i.e. not limited to gt128b.
>
> Note-1: m4 file is not introduced because many duplications are highly
> reduced now.
> Note-2: We guess the code review for this big patch would probably take
> some time and we may need to merge latest code from master branch from
> time to time. We prefer to keep aarch64_neon/sve.ad and the
> corresponding m4 files for easy comparison and review. Of course, they
> will be finally removed after some solid reviews before integration.
> Note-3: Several other minor refactorings are done in this patch, but we
> cannot list all of them in the commit message. We have reviewed and
> tested the rules carefully to guarantee the quality.
>
> **TESTING**
>
> 1) Cross compilations on arm32/s390/pps/riscv passed.
> 2) tier1~3 jtreg passed on both x64 and aarch64 machines.
> 3) vector tests: all the test cases under the following directories can
> pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
>
> "test/hotspot/jtreg/compiler/vectorapi/"
> "test/jdk/jdk/incubator/vector/"
> "test/hotspot/jtreg/compiler/vectorization/"
>
> 4) Performance evaluation: we choose vector micro-benchmarks from
> panama-vector:vectorIntrinsics [2] to evaluate the performance of this
> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
> platform and one NEON platform, and didn't see any visiable regression
> with NEON and SVE. We will continue to verify more cases on other
> platforms with NEON and different SVE vector sizes.
>
> **BENEFITS**
>
> The number of matching rules is reduced to ~ **42%**.
> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
> after : 313 (aarch64_vector.ad)
>
> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**.
> before: 25246528 B (commit 7905788e969)
> after : 24208776 B (**nearly 1 MB reduction**)
>
> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
>
> Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
> Co-Developed-by: Eric Liu <eric.c.liu at arm.com>
Hao Sun has updated the pull request incrementally with one additional commit since the last revision:
Fix Jtreg failures on NEON with -XX:MaxVectorSize=8
[Byte|Short|Int|Long|Float|Double]128VectorTests.java failed on NEON
platform with -XX:MaxVectorSize=8.
The root cause lies in that 128-bit vector MulReduction operations would
pass the check in match_rule_supported_vector(), i.e. returning true if
lengh_in_bytes is 16, which should be invalid since MaxVectorSize=8.
In this patch, we improve the check logic a bit, i.e. failing fast if
the conditions are not satisfied and falling through to check
vector_size_supported() eventually.
Similar update is done to the check for Op_MulAddVS2VI.
-------------
Changes:
- all: https://git.openjdk.org/jdk/pull/9346/files
- new: https://git.openjdk.org/jdk/pull/9346/files/3e85e2ad..87899214
Webrevs:
- full: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=04
- incr: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=03-04
Stats: 20 lines in 2 files changed: 16 ins; 0 del; 4 mod
Patch: https://git.openjdk.org/jdk/pull/9346.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346
PR: https://git.openjdk.org/jdk/pull/9346
More information about the hotspot-compiler-dev
mailing list