RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v7]
Hao Sun
haosun at openjdk.org
Tue Aug 16 09:02:57 UTC 2022
> **MOTIVATION**
>
> This is a big refactoring patch of merging rules in aarch64_sve.ad and
> aarch64_neon.ad. The motivation can also be found at [1].
>
> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
> and NEON codegen respectively. 1) For SVE rules we use vReg operand to
> match VecA for an arbitrary length of vector type, when SVE is enabled;
> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
> 128-bit/64-bit vectors, when SVE is not enabled.
>
> This separation looked clean at the time of introducing SVE support.
> However, there are two main drawbacks now.
>
> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
> SVE vector registers share the lower 128 bits with NEON registers. For
> some cases, even when SVE is enabled, we still prefer to match NEON
> rules and emit NEON instructions.
>
> **Drawback-2**: With more and more vector rules added to support VectorAPI,
> there are lots of rules in both two ad files with different predication
> conditions, e.g., different values of UseSVE or vector type/size.
>
> Examples can be found in [1]. These two drawbacks make the code less
> maintainable and increase the libjvm.so code size.
>
> **KEY UPDATES**
>
> In this patch, we mainly do two things, using generic vReg to match all
> NEON/SVE vector registers and merging NEON/SVE matching rules.
>
> - Update-1: Use generic vReg to match all NEON/SVE vector registers
>
> Two different approaches were considered, and we prefer to use generic
> vector solution but keep VecA operand for all >128-bit vectors. See the
> last slide in [1]. All the changes lie in the AArch64 backend.
>
> 1) Some helpers are updated in aarch64.ad to enable generic vector on
> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
> is_reg2reg_move() and is_generic_vector().
>
> 2) Operand vecA is created to match VecA register, and vReg is updated
> to match VecA/D/X registers dynamically.
>
> With the introduction of generic vReg, difference in register types
> between NEON rules and SVE rules can be eliminated, which makes it easy
> to merge these rules.
>
> - Update-2: Try to merge existing rules
>
> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
> introduced to hold the grouped and merged matching rules.
>
> 1) Similar rules with difference in vector type/size can be merged into
> new rules, where different types and vector sizes are handled in the
> codegen part, e.g., vadd(). This resolves **Drawback-2**.
>
> 2) In most cases, we tend to emit NEON instructions for 128-bit vector
> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**.
>
> It's important to note that there are some exceptions.
>
> Exception-1: For some rules, there are no direct NEON instructions, but
> exists simple SVE implementation due to newly added SVE ISA. Such rules
> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
>
> Exception-2: Vector mask generation and operation rules are different
> because vector mask is stored in different types of registers between
> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
>
> Exception-3: Shift right related rules are different because vector
> shift right instructions differ a bit between NEON and SVE.
>
> For these exceptions, we emit NEON or SVE code simply based on UseSVE
> options.
>
> **MINOR UPDATES and CODE REFACTORING**
>
> Since we've touched all lines of code during merging rules, we further
> do more minor updates and refactoring.
>
> - Reduce regmask bits
>
> Stack slot alignment is handled specially for scalable vector, which
> will firstly align to SlotsPerVecA, and then align to the real vector
> length. We should guarantee SlotsPerVecA is no bigger than the real
> vector length. Otherwise, unused stack space would be allocated.
>
> In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
> on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
> from 8 to 4.
>
> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
> (chunk1 and vectora_reg).
>
> - Refactor NEON/SVE vector op support check.
>
> Merge NEON and SVE vector supported check into one single function. To
> be consistent, SVE default size supported check now is relaxed from no
> less than 64 bits to the same condition as NEON's min_vector_size(),
> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
> as we assume at least we will emit NEON code for those small vectors,
> with unified rules.
>
> - Some notes for new rules
>
> 1) Since new rules are unique and it makes no sense to set different
> "ins_cost", we turn to use the default cost.
>
> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
> can be removed.
>
> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
> matching rule names if needed.
> a) 'le128b' means the vector length is less than or equal to 128 bits.
> This rule can be matched on both NEON and 128-bit SVE.
> b) 'gt128b' means the vector length is greater than 128 bits. This rule
> can only be matched on SVE.
> c) 'neon' means this rule can only be matched on NEON, i.e. the
> generated instruction is not better than those in 128-bit SVE.
> d) 'sve' means this rule is only matched on SVE for all possible vector
> length, i.e. not limited to gt128b.
>
> Note-1: m4 file is not introduced because many duplications are highly
> reduced now.
> Note-2: We guess the code review for this big patch would probably take
> some time and we may need to merge latest code from master branch from
> time to time. We prefer to keep aarch64_neon/sve.ad and the
> corresponding m4 files for easy comparison and review. Of course, they
> will be finally removed after some solid reviews before integration.
> Note-3: Several other minor refactorings are done in this patch, but we
> cannot list all of them in the commit message. We have reviewed and
> tested the rules carefully to guarantee the quality.
>
> **TESTING**
>
> 1) Cross compilations on arm32/s390/pps/riscv passed.
> 2) tier1~3 jtreg passed on both x64 and aarch64 machines.
> 3) vector tests: all the test cases under the following directories can
> pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
>
> "test/hotspot/jtreg/compiler/vectorapi/"
> "test/jdk/jdk/incubator/vector/"
> "test/hotspot/jtreg/compiler/vectorization/"
>
> 4) Performance evaluation: we choose vector micro-benchmarks from
> panama-vector:vectorIntrinsics [2] to evaluate the performance of this
> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
> platform and one NEON platform, and didn't see any visiable regression
> with NEON and SVE. We will continue to verify more cases on other
> platforms with NEON and different SVE vector sizes.
>
> **BENEFITS**
>
> The number of matching rules is reduced to ~ **42%**.
> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
> after : 313 (aarch64_vector.ad)
>
> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**.
> before: 25246528 B (commit 7905788e969)
> after : 24208776 B (**nearly 1 MB reduction**)
>
> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
>
> Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
> Co-Developed-by: Eric Liu <eric.c.liu at arm.com>
Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits:
- Merge branch 'master' as of 10th-Aug into 8285790-merge-rules
Resolve the conflict in file c2_MacroAssembler_aarch64.hpp.
- Merge with jdk/master
- Remove unused ad and m4 files
- Fix Jtreg failures on NEON with -XX:MaxVectorSize=8
[Byte|Short|Int|Long|Float|Double]128VectorTests.java failed on NEON
platform with -XX:MaxVectorSize=8.
The root cause lies in that 128-bit vector MulReduction operations would
pass the check in match_rule_supported_vector(), i.e. returning true if
lengh_in_bytes is 16, which should be invalid since MaxVectorSize=8.
In this patch, we improve the check logic a bit, i.e. failing fast if
the conditions are not satisfied and falling through to check
vector_size_supported() eventually.
Similar update is done to the check for Op_MulAddVS2VI.
- Improve comments and fix typos
1. Improve the comment for MulReductionV*.
2. Fix typos for REDUCE_BITWISE_OP_XX macros.
3. For the typo in vector mask load/store rules.
- Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules
Merge branch "master".
- Add m4 file
Add the corresponding M4 file
- Add VM_Version flag to control NEON instruction generation
Add VM_Version flag use_neon_for_vector() to control whether to generate
NEON instructions for 128-bit vector operations.
Currently only vector length is checked inside and it returns true for
existing SVE cores. More specific things might be checked in the near
future, e.g., the basic data type or SVE CPU model.
Besides, new macro assembler helpers neon_vector_extend/narrow() are
introduced to make the code clean.
Note: AddReductionVF/D rules are updated so that SVE instructions are
generated for 64/128-bit vector operations, because floating point
reduction add instructions are supported directly in SVE.
- Merge branch 'master' as of 7th-July into 8285790-merge-rules
- 8285790: AArch64: Merge C2 NEON and SVE matching rules
MOTIVATION
This is a big refactoring patch of merging rules in aarch64_sve.ad and
aarch64_neon.ad. The motivation can also be found at [1].
Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE
and NEON codegen respectively. 1) For SVE rules we use vReg operand to
match VecA for an arbitrary length of vector type, when SVE is enabled;
2) For NEON rules we use vecX/vecD operands to match VecX/VecD for
128-bit/64-bit vectors, when SVE is not enabled.
This separation looked clean at the time of introducing SVE support.
However, there are two main drawbacks now.
Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and
SVE vector registers share the lower 128 bits with NEON registers. For
some cases, even when SVE is enabled, we still prefer to match NEON
rules and emit NEON instructions.
Drawback-2: With more and more vector rules added to support VectorAPI,
there are lots of rules in both two ad files with different predication
conditions, e.g., different values of UseSVE or vector type/size.
Examples can be found in [1]. These two drawbacks make the code less
maintainable and increase the libjvm.so code size.
KEY UPDATES
In this patch, we mainly do two things, using generic vReg to match all
NEON/SVE vector registers and merging NEON/SVE matching rules.
Update-1: Use generic vReg to match all NEON/SVE vector registers
Two different approaches were considered, and we prefer to use generic
vector solution but keep VecA operand for all >128-bit vectors. See the
last slide in [1]. All the changes lie in the AArch64 backend.
1) Some helpers are updated in aarch64.ad to enable generic vector on
AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(),
is_reg2reg_move() and is_generic_vector().
2) Operand vecA is created to match VecA register, and vReg is updated
to match VecA/D/X registers dynamically.
With the introduction of generic vReg, difference in register types
between NEON rules and SVE rules can be eliminated, which makes it easy
to merge these rules.
Update-2: Try to merge existing rules
As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is
introduced to hold the grouped and merged matching rules.
1) Similar rules with difference in vector type/size can be merged into
new rules, where different types and vector sizes are handled in the
codegen part, e.g., vadd(). This resolves Drawback-2.
2) In most cases, we tend to emit NEON instructions for 128-bit vector
operations on SVE platforms, e.g., vadd(). This resolves Drawback-1.
It's important to note that there are some exceptions.
Exception-1: For some rules, there are no direct NEON instructions, but
exists simple SVE implementation due to newly added SVE ISA. Such rules
include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon,
reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon,
reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon.
Exception-2: Vector mask generation and operation rules are different
because vector mask is stored in different types of registers between
NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules.
Exception-3: Shift right related rules are different because vector
shift right instructions differ a bit between NEON and SVE.
For these exceptions, we emit NEON or SVE code simply based on UseSVE
options.
MINOR UPDATES and CODE REFACTORING
Since we've touched all lines of code during merging rules, we further
do more minor updates and refactoring.
1. Reduce regmask bits
Stack slot alignment is handled specially for scalable vector, which
will firstly align to SlotsPerVecA, and then align to the real vector
length. We should guarantee SlotsPerVecA is no bigger than the real
vector length. Otherwise, unused stack space would be allocated.
In AArch64 SVE, the vector length can be 128 to 2048 bits. However,
SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result,
on a 128-bit SVE platform, the stack slot is aligned to 256 bits,
leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA
from 8 to 4.
See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad
(chunk1 and vectora_reg).
2. Refactor NEON/SVE vector op support check.
Merge NEON and SVE vector supported check into one single function. To
be consistent, SVE default size supported check now is relaxed from no
less than 64 bits to the same condition as NEON's min_vector_size(),
i.e. 4 bytes and 4/2 booleans are also supported. This should be fine,
as we assume at least we will emit NEON code for those small vectors,
with unified rules.
3. Some notes for new rules
1) Since new rules are unique and it makes no sense to set different
"ins_cost", we turn to use the default cost.
2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad
now. Hence, many SIMD pipeline classes at aarch64.ad become unused and
can be removed.
3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the
matching rule names if needed.
a) 'le128b' means the vector length is less than or equal to 128 bits.
This rule can be matched on both NEON and 128-bit SVE.
b) 'gt128b' means the vector length is greater than 128 bits. This rule
can only be matched on SVE.
c) 'neon' means this rule can only be matched on NEON, i.e. the
generated instruction is not better than those in 128-bit SVE.
d) 'sve' means this rule is only matched on SVE for all possible vector
length, i.e. not limited to gt128b.
Note-1: m4 file is not introduced because many duplications are highly
reduced now.
Note-2: We guess the code review for this big patch would probably take
some time and we may need to merge latest code from master branch from
time to time. We prefer to keep aarch64_neon/sve.ad and the
corresponding m4 files for easy comparison and review. Of course, they
will be finally removed after some solid reviews before integration.
Note-3: Several other minor refactorings are done in this patch, but we
cannot list all of them in the commit message. We have reviewed and
tested the rules carefully to guarantee the quality.
TESTING
1) Cross compilations on arm32/s390/pps/riscv passed.
2) tier1~3 jtreg passed on both x64 and aarch64 machines.
3) vector tests: all the test cases under the following directories can
pass on both NEON and SVE systems with max vector length 16/32/64 bytes.
"test/hotspot/jtreg/compiler/vectorapi/"
"test/jdk/jdk/incubator/vector/"
"test/hotspot/jtreg/compiler/vectorization/"
4) Performance evaluation: we choose vector micro-benchmarks from
panama-vector:vectorIntrinsics [2] to evaluate the performance of this
patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE
platform and one NEON platform, and didn't see any visiable regression
with NEON and SVE. We will continue to verify more cases on other
platforms with NEON and different SVE vector sizes.
BENEFITS
The number of matching rules is reduced to ~ 42%.
before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753
after : 313(aarch64_vector.ad)
Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%.
before: 25246528 B (commit 7905788e969)
after : 24208776 B (nearly 1 MB reduction)
[1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf
[2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation
Co-Developed-by: Ningsheng Jian <Ningsheng.Jian at arm.com>
Co-Developed-by: Eric Liu <eric.c.liu at arm.com>
-------------
Changes: https://git.openjdk.org/jdk/pull/9346/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=06
Stats: 30563 lines in 19 files changed: 11618 ins; 18822 del; 123 mod
Patch: https://git.openjdk.org/jdk/pull/9346.diff
Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346
PR: https://git.openjdk.org/jdk/pull/9346
More information about the hotspot-compiler-dev
mailing list