From xgong at openjdk.org Fri Jul 1 01:23:40 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 1 Jul 2022 01:23:40 GMT Subject: Integrated: 8288294: [vector] Add Identity/Ideal transformations for vector logic operations In-Reply-To: References: Message-ID: On Mon, 20 Jun 2022 07:50:09 GMT, Xiaohong Gong wrote: > This patch adds the following transformations for vector logic operations such as "`AndV, OrV, XorV`", incuding: > > (AndV v (Replicate m1)) => v > (AndV v (Replicate zero)) => Replicate zero > (AndV v v) => v > > (OrV v (Replicate m1)) => Replicate m1 > (OrV v (Replicate zero)) => v > (OrV v v) => v > > (XorV v v) => Replicate zero > > where "`m1`" is the integer constant -1, together with the same optimizations for vector mask operations like "`AndVMask, OrVMask, XorVMask`". This pull request has now been integrated. Changeset: 124c63c1 Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/124c63c17c897404e3c5c3615d6727303e4f3d06 Stats: 639 lines in 4 files changed: 629 ins; 0 del; 10 mod 8288294: [vector] Add Identity/Ideal transformations for vector logic operations Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.org/jdk/pull/9211 From xgong at openjdk.org Fri Jul 1 02:42:46 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Fri, 1 Jul 2022 02:42:46 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v8] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 17:51:34 GMT, Vladimir Kozlov wrote: > An other test failed in tier2: compiler/loopopts/superword/TestPickFirstMemoryState.java Details are in RFE. Thanks for the tests again! It seems this is the same issue with https://github.com/openjdk/jdk/pull/2867 that the type of `in(2)` is `Type::TOP`. We need to add save the vector type for `ReductionNode`. I will fix it soon! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From thartmann at openjdk.org Fri Jul 1 05:27:31 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 1 Jul 2022 05:27:31 GMT Subject: [jdk19] Integrated: 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 15:16:03 GMT, Tobias Hartmann wrote: > Similar to https://github.com/openjdk/jdk/pull/425 and https://github.com/openjdk/jdk/pull/649, entry control to a loop `RegionNode` dies right after parsing (during first IGVN) but the dead loop is not detected/removed. This dead loop then keeps a subgraph alive, which leads to two different failures in later optimization phases that are described below. > > I assumed that such dead loops should always be detected, but to avoid a full reachability analysis (graph walk to root), C2 only detects and removes "unsafe" dead loops, i.e., dead loops that might cause issues for later optimization phases and should therefore be aggressively removed. See `RegionNode::Ideal` -> `RegionNode::is_unreachable_region` -> `RegionNode::is_possible_unsafe_loop`: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L541-L549 > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L327-L331 > > Here is a detailed description of the two failures and the corresponding fixes: > > 1) `No reachable node should have no use` assert at the end of optimizations (introduced by [JDK-8263577](https://bugs.openjdk.org/browse/JDK-8263577)): > > At the beginning of CCP, the types of all nodes are initialized to `top`. Since the following subgraph is not reachable from root due to a dead loop above in the CFG, the types of all unreachable nodes remain top: > ![1_BeforeCCP](https://user-images.githubusercontent.com/5312595/176446327-e6fdee4d-49ea-4406-9b15-b29366cd9f55.png) > > The `Rethrow`, `Phis` and `Region` are removed during IGVN because they are `top` but the `292 CatchProj` remains: > > ![3_BarrierExpand](https://user-images.githubusercontent.com/5312595/176446385-0374b6ba-7c0b-447d-90f9-c73e3aee4918.png) > > We then hit the assert because the `CatchProj` has no user. Similar to how https://github.com/openjdk/jdk/pull/3012 was fixed, we need to make sure that when `RegionNode` inputs are cut off because their types are `top`, they are added to the IGVN worklist (see change in `cfgnode.cpp:504`). With that, the entire dead subgraph is removed. > > 2) `Unknown node on this path` assert while walking the memory graph during scalar replacement: > > After parsing, the `167 Region` that belongs to a loop loses entry control (marked in red): > ![2_Diff_Parsing_IGVN](https://user-images.githubusercontent.com/5312595/176453465-95f48c16-6cb7-4373-baa8-edf5e4fbcde2.png) > > The dead loop is not detected/removed because it's not considered "unsafe" since the Phis of the dying Region only have a Call user which is considered safe: > > https://github.com/openjdk/jdk19/blob/dbc6e110100aa6aaa8493158312030b84152b33a/src/hotspot/share/opto/cfgnode.cpp#L352-L355 > > ![DyingRegion](https://user-images.githubusercontent.com/5312595/176469880-f81a7d7e-b769-444a-bf5b-14f8cca1f9af.png) > > The same can happen with other CFG users (for example, MemBars or Allocates). These scenarios are also covered by the regression test. Later during IGVN, `309 Region` which is part of the now dead subgraph is processed and found to be potentially "unsafe" and unreachable from root: > > ![1_AfterParsing](https://user-images.githubusercontent.com/5312595/176453110-8a4a587f-f1ef-45bf-8a68-e476f142aa7e.png) > > It's then removed together with its Phi users, leaving `505 MergeMem` with a top memory input: > > ![3_MacroExpansion](https://user-images.githubusercontent.com/5312595/176461343-ab446fe0-04a8-48a5-95c2-c8ead6c872cf.png) > > We then hit the assert when encountering a top memory input while walking the memory graph during scalar replacement. > > The root cause of the failure is an only partially removed dead subgraph. A similar issue has been fixed long ago by [JDK-8075922](https://bugs.openjdk.org/browse/JDK-8075922), but the fix is incomplete. I propose to aggressively remove such dead subgraphs by walking up the CFG when detecting an unreachable Region belonging to an "unsafe" loop and replacing all nodes by `top`. > > Special thanks to Christian Hagedorn for helping me with finding a regression test. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 95497772 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk19/commit/95497772e7207b5752e6ecace4a6686df2b45227 Stats: 289 lines in 2 files changed: 250 ins; 3 del; 36 mod 8284358: Unreachable loop is not removed from C2 IR, leading to a broken graph Co-authored-by: Christian Hagedorn Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.org/jdk19/pull/92 From rrich at openjdk.org Fri Jul 1 06:15:37 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 1 Jul 2022 06:15:37 GMT Subject: RFR: 8289434: x86_64: Improve comment on gen_continuation_enter() In-Reply-To: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> References: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> Message-ID: On Wed, 29 Jun 2022 08:23:36 GMT, Richard Reingruber wrote: > Change code comments for `gen_continuation_enter()` explaining that the generated code will call `Continuation.enter(Continuation c, boolean isContinue)` if the continuation give as first parameter is run for the first time. > > Also mention the special case for resolving this call. Thanks for reviewing! ------------- PR: https://git.openjdk.org/jdk/pull/9320 From rrich at openjdk.org Fri Jul 1 06:15:38 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 1 Jul 2022 06:15:38 GMT Subject: Integrated: 8289434: x86_64: Improve comment on gen_continuation_enter() In-Reply-To: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> References: <6cp6qV9w9mWrUZKkucomdNg9eHYUVj06unri-TFLlbY=.a793a9e6-9fd9-4451-8fb2-0c8df241e893@github.com> Message-ID: On Wed, 29 Jun 2022 08:23:36 GMT, Richard Reingruber wrote: > Change code comments for `gen_continuation_enter()` explaining that the generated code will call `Continuation.enter(Continuation c, boolean isContinue)` if the continuation give as first parameter is run for the first time. > > Also mention the special case for resolving this call. This pull request has now been integrated. Changeset: d260a4e7 Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/d260a4e794681c6f4be4767350702754cfc2035c Stats: 6 lines in 1 file changed: 3 ins; 0 del; 3 mod 8289434: x86_64: Improve comment on gen_continuation_enter() Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/9320 From haosun at openjdk.org Fri Jul 1 10:46:17 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 1 Jul 2022 10:46:17 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules Message-ID: **MOTIVATION** This is a big refactoring patch of merging rules in aarch64_sve.ad and aarch64_neon.ad. The motivation can also be found at [1]. Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE and NEON codegen respectively. 1) For SVE rules we use vReg operand to match VecA for an arbitrary length of vector type, when SVE is enabled; 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for 128-bit/64-bit vectors, when SVE is not enabled. This separation looked clean at the time of introducing SVE support. However, there are two main drawbacks now. **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and SVE vector registers share the lower 128 bits with NEON registers. For some cases, even when SVE is enabled, we still prefer to match NEON rules and emit NEON instructions. **Drawback-2**: With more and more vector rules added to support VectorAPI, there are lots of rules in both two ad files with different predication conditions, e.g., different values of UseSVE or vector type/size. Examples can be found in [1]. These two drawbacks make the code less maintainable and increase the libjvm.so code size. **KEY UPDATES** In this patch, we mainly do two things, using generic vReg to match all NEON/SVE vector registers and merging NEON/SVE matching rules. - Update-1: Use generic vReg to match all NEON/SVE vector registers Two different approaches were considered, and we prefer to use generic vector solution but keep VecA operand for all >128-bit vectors. See the last slide in [1]. All the changes lie in the AArch64 backend. 1) Some helpers are updated in aarch64.ad to enable generic vector on AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), is_reg2reg_move() and is_generic_vector(). 2) Operand vecA is created to match VecA register, and vReg is updated to match VecA/D/X registers dynamically. With the introduction of generic vReg, difference in register types between NEON rules and SVE rules can be eliminated, which makes it easy to merge these rules. - Update-2: Try to merge existing rules As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is introduced to hold the grouped and merged matching rules. 1) Similar rules with difference in vector type/size can be merged into new rules, where different types and vector sizes are handled in the codegen part, e.g., vadd(). This resolves **Drawback-2**. 2) In most cases, we tend to emit NEON instructions for 128-bit vector operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. It's important to note that there are some exceptions. Exception-1: For some rules, there are no direct NEON instructions, but exists simple SVE implementation due to newly added SVE ISA. Such rules include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. Exception-2: Vector mask generation and operation rules are different because vector mask is stored in different types of registers between NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. Exception-3: Shift right related rules are different because vector shift right instructions differ a bit between NEON and SVE. For these exceptions, we emit NEON or SVE code simply based on UseSVE options. **MINOR UPDATES and CODE REFACTORING** Since we've touched all lines of code during merging rules, we further do more minor updates and refactoring. - Reduce regmask bits Stack slot alignment is handled specially for scalable vector, which will firstly align to SlotsPerVecA, and then align to the real vector length. We should guarantee SlotsPerVecA is no bigger than the real vector length. Otherwise, unused stack space would be allocated. In AArch64 SVE, the vector length can be 128 to 2048 bits. However, SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, on a 128-bit SVE platform, the stack slot is aligned to 256 bits, leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA from 8 to 4. See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad (chunk1 and vectora_reg). - Refactor NEON/SVE vector op support check. Merge NEON and SVE vector supported check into one single function. To be consistent, SVE default size supported check now is relaxed from no less than 64 bits to the same condition as NEON's min_vector_size(), i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, as we assume at least we will emit NEON code for those small vectors, with unified rules. - Some notes for new rules 1) Since new rules are unique and it makes no sense to set different "ins_cost", we turn to use the default cost. 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad now. Hence, many SIMD pipeline classes at aarch64.ad become unused and can be removed. 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the matching rule names if needed. a) 'le128b' means the vector length is less than or equal to 128 bits. This rule can be matched on both NEON and 128-bit SVE. b) 'gt128b' means the vector length is greater than 128 bits. This rule can only be matched on SVE. c) 'neon' means this rule can only be matched on NEON, i.e. the generated instruction is not better than those in 128-bit SVE. d) 'sve' means this rule is only matched on SVE for all possible vector length, i.e. not limited to gt128b. Note-1: m4 file is not introduced because many duplications are highly reduced now. Note-2: We guess the code review for this big patch would probably take some time and we may need to merge latest code from master branch from time to time. We prefer to keep aarch64_neon/sve.ad and the corresponding m4 files for easy comparison and review. Of course, they will be finally removed after some solid reviews before integration. Note-3: Several other minor refactorings are done in this patch, but we cannot list all of them in the commit message. We have reviewed and tested the rules carefully to guarantee the quality. **TESTING** 1) Cross compilations on arm32/s390/pps/riscv passed. 2) tier1~3 jtreg passed on both x64 and aarch64 machines. 3) vector tests: all the test cases under the following directories can pass on both NEON and SVE systems with max vector length 16/32/64 bytes. "test/hotspot/jtreg/compiler/vectorapi/" "test/jdk/jdk/incubator/vector/" "test/hotspot/jtreg/compiler/vectorization/" 4) Performance evaluation: we choose vector micro-benchmarks from panama-vector:vectorIntrinsics [2] to evaluate the performance of this patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE platform and one NEON platform, and didn't see any visiable regression with NEON and SVE. We will continue to verify more cases on other platforms with NEON and different SVE vector sizes. **BENEFITS** The number of matching rules is reduced to ~ **42%**. before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 after : 313 (aarch64_vector.ad) Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. before: 25246528 B (commit 7905788e969) after : 24208776 B (**nearly 1 MB reduction**) [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation Co-Developed-by: Ningsheng Jian Co-Developed-by: Eric Liu ------------- Commit messages: - 8285790: AArch64: Merge C2 NEON and SVE matching rules Changes: https://git.openjdk.org/jdk/pull/9346/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8285790 Stats: 7151 lines in 12 files changed: 6454 ins; 576 del; 121 mod Patch: https://git.openjdk.org/jdk/pull/9346.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346 PR: https://git.openjdk.org/jdk/pull/9346 From aph at openjdk.org Fri Jul 1 11:29:42 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 1 Jul 2022 11:29:42 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: References: Message-ID: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> On Fri, 1 Jul 2022 10:36:36 GMT, Hao Sun wrote: > **MOTIVATION** > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > **Drawback-2**: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > **KEY UPDATES** > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > - Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > - Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves **Drawback-2**. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > **MINOR UPDATES and CODE REFACTORING** > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > - Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > - Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > - Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > **TESTING** > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > **BENEFITS** > > The number of matching rules is reduced to ~ **42%**. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313 (aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (**nearly 1 MB reduction**) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu Aha! I was looking forward to this. On 7/1/22 11:46, Hao Sun wrote: > Note-1: m4 file is not introduced because many duplications are highly > reduced now. Yes, but there's still a lot of duplications. I'll make a few examples of where you should make simple changes that will usefully increase the level of abstraction. That will be a start. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From lucy at openjdk.org Fri Jul 1 13:34:40 2022 From: lucy at openjdk.org (Lutz Schmidt) Date: Fri, 1 Jul 2022 13:34:40 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 15:21:36 GMT, Thomas Stuefe wrote: >> This fixes three warnings in my gcc 12 build on Ubuntu 22.04. > > Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: > > assert sprint did not overflow Changes look good to me. Thank you for investing your time. ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.org/jdk/pull/9335 From stuefe at openjdk.org Fri Jul 1 13:47:13 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 1 Jul 2022 13:47:13 GMT Subject: Integrated: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 13:51:00 GMT, Thomas Stuefe wrote: > This fixes three warnings in my gcc 12 build on Ubuntu 22.04. This pull request has now been integrated. Changeset: a8fe2d97 Author: Thomas Stuefe URL: https://git.openjdk.org/jdk/commit/a8fe2d97a2ea1d3ce70d6095740c4ac7ec113761 Stats: 13 lines in 1 file changed: 1 ins; 3 del; 9 mod 8289512: Fix GCC 12 warnings for adlc output_c.cpp Reviewed-by: kvn, lucy ------------- PR: https://git.openjdk.org/jdk/pull/9335 From stuefe at openjdk.org Fri Jul 1 13:47:12 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Fri, 1 Jul 2022 13:47:12 GMT Subject: RFR: JDK-8289512: Fix GCC 12 warnings for adlc output_c.cpp [v2] In-Reply-To: <8Ldc39Cx8nywg-ioyszJd6avqnpMne3GduVVW7rhsOA=.c3d6fa42-3a30-44cd-812f-732f4a34b479@github.com> References: <8Ldc39Cx8nywg-ioyszJd6avqnpMne3GduVVW7rhsOA=.c3d6fa42-3a30-44cd-812f-732f4a34b479@github.com> Message-ID: On Thu, 30 Jun 2022 16:04:43 GMT, Vladimir Kozlov wrote: >> Thomas Stuefe has updated the pull request incrementally with one additional commit since the last revision: >> >> assert sprint did not overflow > > Good. Thanks @vnkozlov and @RealLucy ! ------------- PR: https://git.openjdk.org/jdk/pull/9335 From coleenp at openjdk.org Fri Jul 1 14:12:18 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 1 Jul 2022 14:12:18 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors Message-ID: This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. ------------- Commit messages: - Fix aarch64 overloading to get the right null check. - Revert UseHeavyMonitors setting - 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors Changes: https://git.openjdk.org/jdk/pull/9339/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9339&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8278479 Stats: 30 lines in 6 files changed: 30 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9339.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9339/head:pull/9339 PR: https://git.openjdk.org/jdk/pull/9339 From kvn at openjdk.org Fri Jul 1 14:27:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 1 Jul 2022 14:27:41 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors In-Reply-To: References: Message-ID: <_2SqD62pVJuIWf15noVFV6R7fbvYiDIJCYynMk7HQjY=.eb590ee8-5c79-45a9-9a88-5fc369e7e6eb@github.com> On Thu, 30 Jun 2022 22:05:14 GMT, Coleen Phillimore wrote: > This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. > > Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9339 From dcubed at openjdk.org Fri Jul 1 14:40:27 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Fri, 1 Jul 2022 14:40:27 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors In-Reply-To: References: Message-ID: <7-KXGbS6HnVh8o-0poiuIuqFjokBmquhkSS7EhpA2ns=.ca2a1ff9-e62e-466e-8626-63a8e6bac351@github.com> On Thu, 30 Jun 2022 22:05:14 GMT, Coleen Phillimore wrote: > This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. > > Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. The changes look good, but it has been a long time since I've looked at C1 code so my opinion is rusty... Thanks for testing with Mach5 Tier[1-7]. Did you do any targeted testing with -XX:+UseHeavyMonitors and -XX:+VerifyHeavyMonitors with RunThese? ------------- Marked as reviewed by dcubed (Reviewer). PR: https://git.openjdk.org/jdk/pull/9339 From coleenp at openjdk.org Fri Jul 1 14:44:32 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 1 Jul 2022 14:44:32 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors In-Reply-To: References: Message-ID: <2mZ2Pao5Q4MCmv4L4zjtHrm7cHkz6zy-81dsGpFYxnE=.c3e38380-282a-48a0-ab68-c4b4f5f71e19@github.com> On Thu, 30 Jun 2022 22:05:14 GMT, Coleen Phillimore wrote: > This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. > > Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. Thanks Vladimir and Dan for the quick reviews. I ran all of the tiers with UseHeavyMonitors on, except I stopped the long running tests in tier8 because they would have failed right away if it was wrong. The test for NullPointerExceptionMessage directly exercises this code with the options. RunThese didn't reproduce this bug locally or with several non-local runs but the stack was the same as the NPE test. ------------- PR: https://git.openjdk.org/jdk/pull/9339 From dlong at openjdk.org Fri Jul 1 20:22:47 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 1 Jul 2022 20:22:47 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors In-Reply-To: References: Message-ID: <2smVdrxSBqWvHWeHLSPI4T1MiXN5p-3WmeC6c5-4ERc=.f93f73ef-f5f5-4223-947f-503b924c1568@github.com> On Thu, 30 Jun 2022 22:05:14 GMT, Coleen Phillimore wrote: > This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. > > Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. Marked as reviewed by dlong (Reviewer). src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 2563: > 2561: if (op->info() != NULL) { > 2562: int null_check_offset = __ offset(); > 2563: __ null_check(obj, -1); Suggestion: __ null_check(obj); src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2438: > 2436: int null_check_offset = __ offset(); > 2437: __ null_check(obj); > 2438: add_debug_info_for_null_check(null_check_offset, op->info()); Is this equivalent to the following? add_debug_info_for_null_check_here(op->info()); __ null_check(obj); ------------- PR: https://git.openjdk.org/jdk/pull/9339 From vlivanov at openjdk.org Fri Jul 1 23:01:44 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 1 Jul 2022 23:01:44 GMT Subject: [jdk19] RFR: 8280320: C2: Loop opts are missing during OSR compilation [v2] In-Reply-To: References: Message-ID: > After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. > > For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. > > It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot > backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` > doesn't kick in thus effectively blocking any further loop optimizations. > > Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and > enables `ciMethod::scale_count()` to report sane numbers. > > Testing: > - hs-tier1 - hs-tier4 Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Improve comment ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/38/files - new: https://git.openjdk.org/jdk19/pull/38/files/ce36c789..ffd6d78b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=38&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=38&range=00-01 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/38.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/38/head:pull/38 PR: https://git.openjdk.org/jdk19/pull/38 From vlivanov at openjdk.org Fri Jul 1 23:01:45 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 1 Jul 2022 23:01:45 GMT Subject: [jdk19] RFR: 8280320: C2: Loop opts are missing during OSR compilation In-Reply-To: References: Message-ID: On Fri, 17 Jun 2022 21:29:41 GMT, Vladimir Ivanov wrote: > After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. > > For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. > > It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot > backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` > doesn't kick in thus effectively blocking any further loop optimizations. > > Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and > enables `ciMethod::scale_count()` to report sane numbers. > > Testing: > - hs-tier1 - hs-tier4 Thanks for the reviews, Tobias & Vladimir. I addressed your comments. ------------- PR: https://git.openjdk.org/jdk19/pull/38 From vlivanov at openjdk.org Fri Jul 1 23:01:46 2022 From: vlivanov at openjdk.org (Vladimir Ivanov) Date: Fri, 1 Jul 2022 23:01:46 GMT Subject: [jdk19] Integrated: 8280320: C2: Loop opts are missing during OSR compilation In-Reply-To: References: Message-ID: <3SH6Qad56H1Xg1AfYr9oJbs7lI6S1YPzy2WDs5Zb-BQ=.4c412e07-f734-4e73-b3cc-aaeb668fc2dd@github.com> On Fri, 17 Jun 2022 21:29:41 GMT, Vladimir Ivanov wrote: > After [JDK-8272330](https://bugs.openjdk.org/browse/JDK-8272330), OSR compilations may completely miss loop optimizations pass due to misleading profiling data. The cleanup changed how profile counts are scaled and it had surprising effect on OSR compilations. > > For a long-running loop it's common to have an MDO allocated during the first invocation while running in the loop. Also, OSR compilation may be scheduled while running the very first method invocation. In such case, `MethodData::invocation_counter() == 0` while `MethodData::backedge_counter() > 0`. Before JDK-8272330 went in, `ciMethod::scale_count()` took into account both `invocation_counter()` and `backedge_counter()`. Now `MethodData::invocation_counter()` is taken by `ciMethod::scale_count()` as is and it forces all counts to be unconditionally scaled to `1`. > > It misleads `IdealLoopTree::beautify_loops()` to believe there are no hot > backedges in the loop being compiled and `IdealLoopTree::split_outer_loop()` > doesn't kick in thus effectively blocking any further loop optimizations. > > Proposed fix bumps `MethodData::invocation_counter()` from `0` to `1` and > enables `ciMethod::scale_count()` to report sane numbers. > > Testing: > - hs-tier1 - hs-tier4 This pull request has now been integrated. Changeset: 99250140 Author: Vladimir Ivanov URL: https://git.openjdk.org/jdk19/commit/9925014035ed203ba42cce80a23730328bbe8a50 Stats: 8 lines in 1 file changed: 7 ins; 0 del; 1 mod 8280320: C2: Loop opts are missing during OSR compilation Reviewed-by: thartmann, iveresov ------------- PR: https://git.openjdk.org/jdk19/pull/38 From iveresov at openjdk.org Sat Jul 2 01:13:01 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 01:13:01 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation Message-ID: Updated man pages from markdown sources. ------------- Commit messages: - Update man pages Changes: https://git.openjdk.org/jdk19/pull/103/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=103&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8245268 Stats: 495 lines in 8 files changed: 437 ins; 16 del; 42 mod Patch: https://git.openjdk.org/jdk19/pull/103.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/103/head:pull/103 PR: https://git.openjdk.org/jdk19/pull/103 From kvn at openjdk.org Sat Jul 2 02:15:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 2 Jul 2022 02:15:46 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation In-Reply-To: References: Message-ID: On Sat, 2 Jul 2022 01:04:19 GMT, Igor Veresov wrote: > Updated man pages from markdown sources. I am not sure some words changes are correct? May be we should cleanup and update all these man pages as separate issue. And apply only `-Xcomp` changes. @iklam you recently pushed changes https://github.com/openjdk/jdk/pull/9024 which included only CDS related update. Did you discuss this issue during your changes review? src/java.base/share/man/java.1 line 3682: > 3680: .RS > 3681: .PP > 3682: The following examples show how to set the mimimum size of allocated `mimimum`? src/java.base/share/man/java.1 line 5367: > 5365: \f[CB];\f[R] > 5366: .PP > 5367: (The names "static" and "dyanmic" are used for historical reasons. `dyanmic` ? ------------- PR: https://git.openjdk.org/jdk19/pull/103 From iveresov at openjdk.org Sat Jul 2 02:48:47 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 02:48:47 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation In-Reply-To: References: Message-ID: On Sat, 2 Jul 2022 02:12:43 GMT, Vladimir Kozlov wrote: > I am not sure some words changes are correct? > > May be we should cleanup and update all these man pages as separate issue. And apply only `-Xcomp` changes. > Then I'd have to manually craft a patch instead of actually generating them. It seems like it's all been out of sync for a long time. ------------- PR: https://git.openjdk.org/jdk19/pull/103 From iveresov at openjdk.org Sat Jul 2 02:48:48 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 02:48:48 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation In-Reply-To: References: Message-ID: <3n97bQQ4ib9ogDH_nIP-ivvywhufhTnQR49_SjlHmF4=.f0e23271-724e-404f-80fa-304ba27258a9@github.com> On Sat, 2 Jul 2022 01:04:19 GMT, Igor Veresov wrote: > Updated man pages from markdown sources. There is also a question of which version is actually correct. Yeah, may be I should manually craft a patch just for the Xcomp part. ------------- PR: https://git.openjdk.org/jdk19/pull/103 From iklam at openjdk.org Sat Jul 2 03:32:26 2022 From: iklam at openjdk.org (Ioi Lam) Date: Sat, 2 Jul 2022 03:32:26 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation In-Reply-To: <3n97bQQ4ib9ogDH_nIP-ivvywhufhTnQR49_SjlHmF4=.f0e23271-724e-404f-80fa-304ba27258a9@github.com> References: <3n97bQQ4ib9ogDH_nIP-ivvywhufhTnQR49_SjlHmF4=.f0e23271-724e-404f-80fa-304ba27258a9@github.com> Message-ID: On Sat, 2 Jul 2022 02:45:43 GMT, Igor Veresov wrote: > There is also a question of which version is actually correct. Yeah, may be I should manually craft a patch just for the Xcomp part. @veresov please see the closed issue https://bugs.openjdk.org/browse/JDK-8287821 @vnkozlov when I did https://github.com/openjdk/jdk/pull/9024, I generated `java.1` from the `java.md` file. I then use the `meld` program on Linux to revert all changes in the `java.1` file that were unrelated to my changes in `java.md`. Yes, it was a pain. ------------- PR: https://git.openjdk.org/jdk19/pull/103 From iveresov at openjdk.org Sat Jul 2 03:48:27 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 03:48:27 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation [v2] In-Reply-To: References: Message-ID: > Updated man pages from markdown sources. Igor Veresov has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: Update java manpage ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/103/files - new: https://git.openjdk.org/jdk19/pull/103/files/37da94df..f3526e7b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=103&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=103&range=00-01 Stats: 485 lines in 8 files changed: 12 ins; 427 del; 46 mod Patch: https://git.openjdk.org/jdk19/pull/103.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/103/head:pull/103 PR: https://git.openjdk.org/jdk19/pull/103 From iveresov at openjdk.org Sat Jul 2 03:48:27 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 03:48:27 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation In-Reply-To: References: Message-ID: On Sat, 2 Jul 2022 01:04:19 GMT, Igor Veresov wrote: > Updated man pages from markdown sources. Ok, I just hacked a patch file. ------------- PR: https://git.openjdk.org/jdk19/pull/103 From kvn at openjdk.org Sat Jul 2 05:01:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 2 Jul 2022 05:01:39 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation [v2] In-Reply-To: References: Message-ID: <7C_JI0a24kKkiiCZ7NOuTH3xhAqVtA25rtD8Br7U6CU=.7fdc28d1-8f20-4aa0-9c6e-342dec4a3cd8@github.com> On Sat, 2 Jul 2022 03:48:27 GMT, Igor Veresov wrote: >> Updated man pages from markdown sources. > > Igor Veresov has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Update java manpage Good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/103 From iveresov at openjdk.org Sat Jul 2 05:57:44 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 05:57:44 GMT Subject: [jdk19] RFR: 8245268: -Xcomp is missing from java launcher documentation [v2] In-Reply-To: References: Message-ID: On Sat, 2 Jul 2022 03:48:27 GMT, Igor Veresov wrote: >> Updated man pages from markdown sources. > > Igor Veresov has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Update java manpage Thanks, Vladimir! ------------- PR: https://git.openjdk.org/jdk19/pull/103 From iveresov at openjdk.org Sat Jul 2 05:57:46 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Sat, 2 Jul 2022 05:57:46 GMT Subject: [jdk19] Integrated: 8245268: -Xcomp is missing from java launcher documentation In-Reply-To: References: Message-ID: On Sat, 2 Jul 2022 01:04:19 GMT, Igor Veresov wrote: > Updated man pages from markdown sources. This pull request has now been integrated. Changeset: f5cdabad Author: Igor Veresov URL: https://git.openjdk.org/jdk19/commit/f5cdabad06b1658d9a3ac01f94cbd29080ffcdb1 Stats: 6 lines in 1 file changed: 6 ins; 0 del; 0 mod 8245268: -Xcomp is missing from java launcher documentation Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk19/pull/103 From sspitsyn at openjdk.org Sat Jul 2 07:09:46 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Sat, 2 Jul 2022 07:09:46 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:23:47 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Remove outdated comment" > > This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. src/hotspot/share/runtime/sharedRuntime.cpp line 1563: > 1561: JRT_BLOCK_ENTRY(address, SharedRuntime::resolve_static_call_C(JavaThread* current )) > 1562: methodHandle callee_method; > 1563: bool enter_special = false; One micro suggestion is to rename: `enter_special => is_enter_special`. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From sspitsyn at openjdk.org Sat Jul 2 07:20:44 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Sat, 2 Jul 2022 07:20:44 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: <_F4Jxh8T1Xb-Td4mGBGmgvtVr2NVCG_oWp7nhvk_Eqw=.24bb6b8a-8880-4b9d-b34d-a2c70691f0f4@github.com> On Sat, 25 Jun 2022 01:23:47 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Remove outdated comment" > > This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. src/hotspot/share/runtime/sharedRuntime.cpp line 1582: > 1580: // but in interp_only_mode we need to go to the interpreted entry > 1581: // The c2i won't patch in this mode -- see fixup_callers_callsite > 1582: return callee_method->get_c2i_entry(); Nit: Dots at the end of lines 1580-1581 would be nice to follow comments style in this file. src/hotspot/share/runtime/sharedRuntime.cpp line 2018: > 2016: if (JavaThread::current()->is_interp_only_mode()) > 2017: return; > 2018: } Nit - micro simplification: if (nm->method()->is_continuation_enter_intrinsic() && JavaThread::current()->is_interp_only_mode()) { return; } ------------- PR: https://git.openjdk.org/jdk19/pull/66 From sspitsyn at openjdk.org Sat Jul 2 07:40:49 2022 From: sspitsyn at openjdk.org (Serguei Spitsyn) Date: Sat, 2 Jul 2022 07:40:49 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:23:47 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Remove outdated comment" > > This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. How was this change tested? In fact, it is not easy to estimate the total impact of this change. I hope, it impacts continuations only, but not very sure yet. src/hotspot/share/runtime/continuation.cpp line 315: > 313: thread->set_cont_fastpath_thread_state(fast); > 314: if (thread->is_interp_only_mode() && ContinuationEntry::enter_special() != nullptr) { > 315: ContinuationEntry::enter_special()->clear_continuation_enter_special_inline_caches(); Will this call impact all JavaThread's, not only the one passed in the argument? Just want to understand this better. Would it be worth to add a comment explaining this aspect (if applicable)? ------------- PR: https://git.openjdk.org/jdk19/pull/66 From jbhateja at openjdk.org Sat Jul 2 18:58:13 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 2 Jul 2022 18:58:13 GMT Subject: [jdk19] RFR: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic Message-ID: Hi All, Patch fixes the assertion failure seen during conditional constant propagation optimization on account of non-convergence, this happens when type values (lattice) associated with IR node seen during iterative data flow analysis are not-monotonic. Problem was occurring due to incorrect result value range estimation by Value routines associated with Compress/ExpandBits IR nodes, non-constant mask lattice can take any value between _lo and _hi values, special handling for +ve mask value range is using count_leading_zeros to estimate the maximum bit width needed to accommodate the result. Since count_leading_zeros accepts a long argument there by sign-extending integer argument, hence for integer case we need to subtract 32 from the results to get correct value. Patch also fixes a typo resulting into a dead code reported by [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855): Problem in compress_expand_identity. Failing unit test java/lang/CompressExpandTest.java has been removed from ProblemList.txt. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic Changes: https://git.openjdk.org/jdk19/pull/104/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=104&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287851 Stats: 7 lines in 2 files changed: 4 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/104.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/104/head:pull/104 PR: https://git.openjdk.org/jdk19/pull/104 From haosun at openjdk.org Mon Jul 4 02:21:37 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 4 Jul 2022 02:21:37 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: References: Message-ID: <4X_4L9tRljWUlHEpiHyrwX3GT59XrU6RpNzX2wF7nYs=.a65e9076-b0ae-407f-9961-b945808becd9@github.com> On Fri, 1 Jul 2022 10:36:36 GMT, Hao Sun wrote: > **MOTIVATION** > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > **Drawback-2**: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > **KEY UPDATES** > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > - Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > - Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves **Drawback-2**. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > **MINOR UPDATES and CODE REFACTORING** > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > - Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > - Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > - Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > **TESTING** > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > **BENEFITS** > > The number of matching rules is reduced to ~ **42%**. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313 (aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (**nearly 1 MB reduction**) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu We got one GHA test failure on linux-x86/tier1, with the following error log. See the [link](https://github.com/shqking/jdk/runs/7173063926?check_suite_focus=true). `Error: Unable to find an artifact with the name: bundles-linux-x86` I suppose it's **not** related to our patch. Besides, we have tested tier1~3 on linux x64/aarch64 locally. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From haosun at openjdk.org Mon Jul 4 02:46:39 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 4 Jul 2022 02:46:39 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> References: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> Message-ID: On Fri, 1 Jul 2022 11:25:36 GMT, Andrew Haley wrote: >> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > Aha! I was looking forward to this. > > On 7/1/22 11:46, Hao Sun wrote: > > Note-1: m4 file is not introduced because many duplications are highly > > reduced now. > > Yes, but there's still a lot of duplications. I'll make a few examples > of where you should make simple changes that will usefully increase the > level of abstraction. That will be a start. @theRealAph Thanks for your comment. Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file. But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform). After that we may discuss whether to use m4 file and how to implement it if needed. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From thartmann at openjdk.org Mon Jul 4 06:06:50 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 4 Jul 2022 06:06:50 GMT Subject: [jdk19] RFR: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic In-Reply-To: References: Message-ID: <1SyJA3Tn5Bq9W1ivkJfowApk35zHKt0MZ_gQlw_RvJI=.216158b4-54c5-4a19-baf2-c130780fb4c3@github.com> On Sat, 2 Jul 2022 18:51:13 GMT, Jatin Bhateja wrote: > Hi All, > > Patch fixes the assertion failure seen during conditional constant propagation optimization on account of > non-convergence, this happens when type values (lattice) associated with IR node seen during iterative data flow analysis are not-monotonic. > > Problem was occurring due to incorrect result value range estimation by Value routines associated with Compress/ExpandBits IR nodes, non-constant mask lattice can take any value between _lo and _hi values, special handling for +ve mask value range is using count_leading_zeros to estimate the maximum bit width needed to accommodate the result. Since count_leading_zeros > accepts a long argument there by sign-extending integer argument, hence for integer case we need to subtract 32 from the results to get correct value. > > Patch also fixes a typo resulting into a dead code reported by [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855): Problem in compress_expand_identity. > > Failing unit test java/lang/CompressExpandTest.java has been removed from ProblemList.txt. > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks good to me. Can we close [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855) as duplicate then? I'll run testing and report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/104 From chagedorn at openjdk.org Mon Jul 4 06:45:22 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Mon, 4 Jul 2022 06:45:22 GMT Subject: [jdk19] RFR: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic In-Reply-To: References: Message-ID: On Sat, 2 Jul 2022 18:51:13 GMT, Jatin Bhateja wrote: > Hi All, > > Patch fixes the assertion failure seen during conditional constant propagation optimization on account of > non-convergence, this happens when type values (lattice) associated with IR node seen during iterative data flow analysis are not-monotonic. > > Problem was occurring due to incorrect result value range estimation by Value routines associated with Compress/ExpandBits IR nodes, non-constant mask lattice can take any value between _lo and _hi values, special handling for +ve mask value range is using count_leading_zeros to estimate the maximum bit width needed to accommodate the result. Since count_leading_zeros > accepts a long argument there by sign-extending integer argument, hence for integer case we need to subtract 32 from the results to get correct value. > > Patch also fixes a typo resulting into a dead code reported by [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855): Problem in compress_expand_identity. > > Failing unit test java/lang/CompressExpandTest.java has been removed from ProblemList.txt. > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/104 From duke at openjdk.org Mon Jul 4 06:48:56 2022 From: duke at openjdk.org (KIRIYAMA Takuya) Date: Mon, 4 Jul 2022 06:48:56 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting [v2] In-Reply-To: References: Message-ID: > The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. > I fixed it to use the appropriate value if the argument of cmd was null. > Please review this change. KIRIYAMA Takuya has updated the pull request incrementally with one additional commit since the last revision: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9318/files - new: https://git.openjdk.org/jdk/pull/9318/files/ca8e51c6..505529dd Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9318&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9318&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9318.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9318/head:pull/9318 PR: https://git.openjdk.org/jdk/pull/9318 From duke at openjdk.org Mon Jul 4 06:48:57 2022 From: duke at openjdk.org (KIRIYAMA Takuya) Date: Mon, 4 Jul 2022 06:48:57 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:47:07 GMT, KIRIYAMA Takuya wrote: > The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. > I fixed it to use the appropriate value if the argument of cmd was null. > Please review this change. I see. The error reported in JDK-8225370 didn't occur in my Win10 environment, but I don't know why didn't it. I restored ProblemList.txt. ------------- PR: https://git.openjdk.org/jdk/pull/9318 From thartmann at openjdk.org Mon Jul 4 06:57:53 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 4 Jul 2022 06:57:53 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting [v2] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 06:48:56 GMT, KIRIYAMA Takuya wrote: >> The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. >> I fixed it to use the appropriate value if the argument of cmd was null. >> Please review this change. > > KIRIYAMA Takuya has updated the pull request incrementally with one additional commit since the last revision: > > 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9318 From jbhateja at openjdk.org Mon Jul 4 08:42:42 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 4 Jul 2022 08:42:42 GMT Subject: [jdk19] RFR: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic In-Reply-To: <1SyJA3Tn5Bq9W1ivkJfowApk35zHKt0MZ_gQlw_RvJI=.216158b4-54c5-4a19-baf2-c130780fb4c3@github.com> References: <1SyJA3Tn5Bq9W1ivkJfowApk35zHKt0MZ_gQlw_RvJI=.216158b4-54c5-4a19-baf2-c130780fb4c3@github.com> Message-ID: On Mon, 4 Jul 2022 06:03:40 GMT, Tobias Hartmann wrote: > Looks good to me. Can we close [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855) as duplicate then? > > I'll run testing and report back once it passed. Hi @TobiHartmann , yes l will resolve 8287855 manually with comments, will check-in once you share test results. ------------- PR: https://git.openjdk.org/jdk19/pull/104 From thartmann at openjdk.org Mon Jul 4 09:10:42 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 4 Jul 2022 09:10:42 GMT Subject: [jdk19] RFR: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic In-Reply-To: References: Message-ID: <2aHaRy9DafC8EFgTZDZLAF5NqaF6ySZQJUgDHiauxT8=.21491c30-522e-47c7-8016-ca21ae112adb@github.com> On Sat, 2 Jul 2022 18:51:13 GMT, Jatin Bhateja wrote: > Hi All, > > Patch fixes the assertion failure seen during conditional constant propagation optimization on account of > non-convergence, this happens when type values (lattice) associated with IR node seen during iterative data flow analysis are not-monotonic. > > Problem was occurring due to incorrect result value range estimation by Value routines associated with Compress/ExpandBits IR nodes, non-constant mask lattice can take any value between _lo and _hi values, special handling for +ve mask value range is using count_leading_zeros to estimate the maximum bit width needed to accommodate the result. Since count_leading_zeros > accepts a long argument there by sign-extending integer argument, hence for integer case we need to subtract 32 from the results to get correct value. > > Patch also fixes a typo resulting into a dead code reported by [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855): Problem in compress_expand_identity. > > Failing unit test java/lang/CompressExpandTest.java has been removed from ProblemList.txt. > > Kindly review and share your feedback. > > Best Regards, > Jatin Thanks. All tests passed. ------------- PR: https://git.openjdk.org/jdk19/pull/104 From xgong at openjdk.org Mon Jul 4 10:19:36 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 4 Jul 2022 10:19:36 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Save vect_type to ReductionNode and VectorMaskOpNode ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9037/files - new: https://git.openjdk.org/jdk/pull/9037/files/8bda7813..cafca904 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9037&range=07-08 Stats: 18 lines in 2 files changed: 10 ins; 3 del; 5 mod Patch: https://git.openjdk.org/jdk/pull/9037.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9037/head:pull/9037 PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Mon Jul 4 10:19:39 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 4 Jul 2022 10:19:39 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v8] In-Reply-To: References: Message-ID: <9ywx_qmwpCXRakLd0KQD2aQC-Kg11LHgOBGXu4VK_Ag=.75d36e21-ca63-4a79-b876-98560c4a1c4b@github.com> On Thu, 30 Jun 2022 17:51:34 GMT, Vladimir Kozlov wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Address review comments > > An other test failed in tier2: compiler/loopopts/superword/TestPickFirstMemoryState.java > Details are in RFE. Hi @vnkozlov , the fixing is pushed to the PR. I tested the failure case on a avx-512 machine, and the failure gone. Could you please help to run the test again? Thanks a lot! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From shade at openjdk.org Mon Jul 4 10:37:42 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 4 Jul 2022 10:37:42 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v4] In-Reply-To: References: Message-ID: On Mon, 27 Jun 2022 12:35:44 GMT, Andrew Haley wrote: >> All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. >> >> Here's an example of what was happening: >> >> ` rax->encoding();` >> >> Where rax is defined as `(Register *)0`. >> >> This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. >> >> >> typedef const RegisterImpl* Register; >> extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; >> inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; >> inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } >> constexpr Register rax = as_register(0); > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > More Looks to me this now subsumes JDK-8289060, but not exactly? For example, the PR for JDK-8289060 has richer comment around `VMRegImpl::stack0`. src/hotspot/share/code/vmreg.hpp line 90: > 88: } > 89: intptr_t value() const { return this - first(); } > 90: static VMReg Bad() { return BAD_REG+first(); } I was confused as to why is it `+first()`. We can probably do: `return as_VMReg(BAD_REG, true);`? ------------- PR: https://git.openjdk.org/jdk/pull/9261 From jbhateja at openjdk.org Mon Jul 4 11:33:02 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 4 Jul 2022 11:33:02 GMT Subject: [jdk19] RFR: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic In-Reply-To: <2aHaRy9DafC8EFgTZDZLAF5NqaF6ySZQJUgDHiauxT8=.21491c30-522e-47c7-8016-ca21ae112adb@github.com> References: <2aHaRy9DafC8EFgTZDZLAF5NqaF6ySZQJUgDHiauxT8=.21491c30-522e-47c7-8016-ca21ae112adb@github.com> Message-ID: On Mon, 4 Jul 2022 09:07:01 GMT, Tobias Hartmann wrote: >> Hi All, >> >> Patch fixes the assertion failure seen during conditional constant propagation optimization on account of >> non-convergence, this happens when type values (lattice) associated with IR node seen during iterative data flow analysis are not-monotonic. >> >> Problem was occurring due to incorrect result value range estimation by Value routines associated with Compress/ExpandBits IR nodes, non-constant mask lattice can take any value between _lo and _hi values, special handling for +ve mask value range is using count_leading_zeros to estimate the maximum bit width needed to accommodate the result. Since count_leading_zeros >> accepts a long argument there by sign-extending integer argument, hence for integer case we need to subtract 32 from the results to get correct value. >> >> Patch also fixes a typo resulting into a dead code reported by [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855): Problem in compress_expand_identity. >> >> Failing unit test java/lang/CompressExpandTest.java has been removed from ProblemList.txt. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Thanks. All tests passed. Thanks @TobiHartmann , @chhagedorn for reviews. ------------- PR: https://git.openjdk.org/jdk19/pull/104 From jbhateja at openjdk.org Mon Jul 4 11:34:16 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 4 Jul 2022 11:34:16 GMT Subject: [jdk19] Integrated: 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic In-Reply-To: References: Message-ID: <3X59vBqcVgAqmc_KfIeJN9XXoFQhu0LqXLkDFQ-1vH4=.73c682fb-54ff-481d-9aa3-3f2715f72649@github.com> On Sat, 2 Jul 2022 18:51:13 GMT, Jatin Bhateja wrote: > Hi All, > > Patch fixes the assertion failure seen during conditional constant propagation optimization on account of > non-convergence, this happens when type values (lattice) associated with IR node seen during iterative data flow analysis are not-monotonic. > > Problem was occurring due to incorrect result value range estimation by Value routines associated with Compress/ExpandBits IR nodes, non-constant mask lattice can take any value between _lo and _hi values, special handling for +ve mask value range is using count_leading_zeros to estimate the maximum bit width needed to accommodate the result. Since count_leading_zeros > accepts a long argument there by sign-extending integer argument, hence for integer case we need to subtract 32 from the results to get correct value. > > Patch also fixes a typo resulting into a dead code reported by [JDK-8287855](https://bugs.openjdk.org/browse/JDK-8287855): Problem in compress_expand_identity. > > Failing unit test java/lang/CompressExpandTest.java has been removed from ProblemList.txt. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 1a271645 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk19/commit/1a271645a84ac4d7d6570e739d42c05cc328891d Stats: 7 lines in 2 files changed: 4 ins; 2 del; 1 mod 8287851: C2 crash: assert(t->meet(t0) == t) failed: Not monotonic Reviewed-by: thartmann, chagedorn ------------- PR: https://git.openjdk.org/jdk19/pull/104 From aph at openjdk.org Mon Jul 4 12:54:43 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 4 Jul 2022 12:54:43 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> References: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> Message-ID: <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> On Fri, 1 Jul 2022 11:25:36 GMT, Andrew Haley wrote: >> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > Aha! I was looking forward to this. > > On 7/1/22 11:46, Hao Sun wrote: > > Note-1: m4 file is not introduced because many duplications are highly > > reduced now. > > Yes, but there's still a lot of duplications. I'll make a few examples > of where you should make simple changes that will usefully increase the > level of abstraction. That will be a start. > @theRealAph Thanks for your comment. Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file. > > But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform). After that we may discuss whether to use m4 file and how to implement it if needed. We can do both: there's no sense in which one excludes the other, and we have time. However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this: if (length_in_bytes <= 16) { // ... Neon } else { assert(UseSVE > 0, "must be sve"); // ... SVE } which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption? ------------- PR: https://git.openjdk.org/jdk/pull/9346 From dnsimon at openjdk.org Mon Jul 4 13:24:57 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 4 Jul 2022 13:24:57 GMT Subject: RFR: 8289687: [JVMCI] bug in HotSpotResolvedJavaMethodImpl.equals Message-ID: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> A bug[1] slipped in with [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094) that broke `HotSpotResolvedJavaMethodImpl.equals`.This PR fixes and adds a test for it. The test was added to `TestResolvedJavaMethod` which was disabled (see [JDK-8249621](https://bugs.openjdk.org/browse/JDK-8249621)). This test class has been re-enabled and the 2 other failing tests in it (`canBeStaticallyBoundTest` and `asStackTraceElementTest`) have been fixed. ------------- Commit messages: - fixed jdk.vm.ci.hotspot.HotSpotResolvedJavaMethodImpl.equals(Object) Changes: https://git.openjdk.org/jdk/pull/9364/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9364&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289687 Stats: 67 lines in 5 files changed: 40 ins; 1 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/9364.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9364/head:pull/9364 PR: https://git.openjdk.org/jdk/pull/9364 From kvn at openjdk.org Mon Jul 4 14:47:26 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 4 Jul 2022 14:47:26 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting [v2] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 06:48:56 GMT, KIRIYAMA Takuya wrote: >> The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. >> I fixed it to use the appropriate value if the argument of cmd was null. >> Please review this change. > > KIRIYAMA Takuya has updated the pull request incrementally with one additional commit since the last revision: > > 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9318 From jbhateja at openjdk.org Mon Jul 4 15:46:55 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 4 Jul 2022 15:46:55 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 02:03:47 GMT, Vladimir Kozlov wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > src/hotspot/cpu/x86/x86.ad line 1762: > >> 1760: break; >> 1761: case Op_LoadVectorMasked: >> 1762: if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) { > > With `UseAVX=0` we clear `supports_avx512bw`. So the test should be > > if (!VM_Version::supports_avx512bw() && is_subword_type(bt) || UseAVX < 1) > > > And may be naive question. Is VectorMaskGen is used for `mask` node creation? If so, why to have separate support checks for `LoadVectorMasked/StoreVectorMasked`? Hi Vladimir, Existing expression gets benefit of short-circuiting else we will need to evaluate two expressions for truly supported case. As of now VectorMaskGen is used for AVX3 targets, specially for partial in-lining for copy and vectorize compare and post vector loop processing. Partial inlining of copy is only enabled for sub-word types and for AVX2 we do not have sub-word handling yet, for vectorized compare partial in-lining we use an explicit threshold i.e. array length is >= 16, thus only sub-word types and 512 bit integer species qualify this threshold. I will be posting a subsequent patch with sub-word handling for masked load/stores over AVX2 after some performance analysis along with maskgen patterns for AVX2. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From aph at openjdk.org Mon Jul 4 15:47:20 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 4 Jul 2022 15:47:20 GMT Subject: RFR: 8289698: AArch64: Need to relativize extended_sp in frame Message-ID: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> With the addition of the extended_sp field in interpreter frames, we need to make sure it is de-relativized and re-relativized when freezing and thawing a vthread. ------------- Commit messages: - 8289698: AArch64: Need to relativize extended_sp in frame Changes: https://git.openjdk.org/jdk/pull/9367/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9367&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289698 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9367.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9367/head:pull/9367 PR: https://git.openjdk.org/jdk/pull/9367 From jbhateja at openjdk.org Mon Jul 4 16:39:29 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 4 Jul 2022 16:39:29 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: > Hi All, > > [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. > > X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. > > This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. > > Please find below the JMH micro stats with and without patch. > > > > System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] > > Baseline: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms > > > With Opt: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms > > > Predicated memory operation over sub-word type will be handled in a subsequent patch. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8289186: Review comments resolved. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9324/files - new: https://git.openjdk.org/jdk/pull/9324/files/c5118c58..b3c193f4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=00-01 Stats: 4 lines in 1 file changed: 1 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9324.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9324/head:pull/9324 PR: https://git.openjdk.org/jdk/pull/9324 From jbhateja at openjdk.org Mon Jul 4 16:43:41 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 4 Jul 2022 16:43:41 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 01:46:43 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8289186: Review comments resolved. > > src/hotspot/share/opto/vectorIntrinsics.cpp line 313: > >> 311: return true; >> 312: } >> 313: > > Why it is placed here without `is_supported` check? Comment does not explain it. Refined the check and updated the comment. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From alanb at openjdk.org Mon Jul 4 19:48:42 2022 From: alanb at openjdk.org (Alan Bateman) Date: Mon, 4 Jul 2022 19:48:42 GMT Subject: RFR: 8289698: AArch64: Need to relativize extended_sp in frame In-Reply-To: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> References: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> Message-ID: <8hToigki1SoIPDg3IJr-xwpDmxKyFldut7fJGnHUJNk=.012100b8-af12-4669-85e0-a913a374231b@github.com> On Mon, 4 Jul 2022 15:39:40 GMT, Andrew Haley wrote: > With the addition of the extended_sp field in interpreter frames, we need to make sure it is de-relativized and re-relativized when freezing and thawing a vthread. I'm skimmed over the changes in JDK-8288971 to understand where the extended SP is coming from so I think this is okay. I've run tier1 & tier2 with this patch and the tests are passing again. ------------- Marked as reviewed by alanb (Reviewer). PR: https://git.openjdk.org/jdk/pull/9367 From dholmes at openjdk.org Mon Jul 4 22:00:27 2022 From: dholmes at openjdk.org (David Holmes) Date: Mon, 4 Jul 2022 22:00:27 GMT Subject: RFR: 8289698: AArch64: Need to relativize extended_sp in frame In-Reply-To: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> References: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> Message-ID: On Mon, 4 Jul 2022 15:39:40 GMT, Andrew Haley wrote: > With the addition of the extended_sp field in interpreter frames, we need to make sure it is de-relativized and re-relativized when freezing and thawing a vthread. I concur with Alan's analysis. We need this fix in pronto else a backout of the other changes. Thanks. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/9367 From haosun at openjdk.org Tue Jul 5 04:25:25 2022 From: haosun at openjdk.org (Hao Sun) Date: Tue, 5 Jul 2022 04:25:25 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> References: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> Message-ID: On Mon, 4 Jul 2022 12:51:22 GMT, Andrew Haley wrote: > However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this: > > ``` > if (length_in_bytes <= 16) { > // ... Neon > } else { > assert(UseSVE > 0, "must be sve"); > // ... SVE > } > ``` > > which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption? Not exactly. It's only for common **64/128-bit unpredicated** vector operations, when NEON have equivalent instructions as SVE. Recall the **Drawback-1** and **Update-2 (part 2)** in the commit message. Besides the code pattern you mentioned, there are many pairs of rules with "**_le128b**" and "**_gt128b**" suffixes, e.g., vmulI_le128b() and vmulI_gt128b(). We use two rules mainly because different numbers of arguments are used. Otherwise, we tend to put them into one rule, which is your mentioned pattern, e.g., vadd(). The main reason we conduct this change lies in that from Neoverse V1 and N2 optimization guides, if the size fit, common NEON instructions are no slower than equivalent SVE instructions in latency and throughput. Note-1: In current aarch64_sve.ad file, there are already several rules under this rule, e.g., loadV16_vreg(), vroundFtoI(), insertI_le128bits(). There is an ongoing patch as well in [link](https://github.com/openjdk/jdk/pull/7999). This patch makes them more clear. Note-2: As we mentioned in the part 4 in **TESTING** section, we ran JMH testing on one SVE machine and didn't observe regression and we will do more measurement on different systems. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From duke at openjdk.org Tue Jul 5 06:21:40 2022 From: duke at openjdk.org (KIRIYAMA Takuya) Date: Tue, 5 Jul 2022 06:21:40 GMT Subject: RFR: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting [v2] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 06:48:56 GMT, KIRIYAMA Takuya wrote: >> The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. >> I fixed it to use the appropriate value if the argument of cmd was null. >> Please review this change. > > KIRIYAMA Takuya has updated the pull request incrementally with one additional commit since the last revision: > > 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting Thank you for your reviews. I hope this fix integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9318 From duke at openjdk.org Tue Jul 5 06:42:22 2022 From: duke at openjdk.org (KIRIYAMA Takuya) Date: Tue, 5 Jul 2022 06:42:22 GMT Subject: Integrated: 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 06:47:07 GMT, KIRIYAMA Takuya wrote: > The problem of JDK-8289427 is caused by using incorrect compiler settings when the auto generated INTRINSIC parameter is null. > I fixed it to use the appropriate value if the argument of cmd was null. > Please review this change. This pull request has now been integrated. Changeset: 1b997db7 Author: KIRIYAMA Takuya Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/1b997db734315f6cd08af94149e6622a8afbe88c Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8289427: compiler/compilercontrol/jcmd/ClearDirectivesFileStackTest.java failed with null setting Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9318 From dnsimon at openjdk.org Tue Jul 5 07:18:33 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 5 Jul 2022 07:18:33 GMT Subject: RFR: 8289687: [JVMCI] bug in HotSpotResolvedJavaMethodImpl.equals In-Reply-To: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> References: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> Message-ID: On Mon, 4 Jul 2022 12:30:57 GMT, Doug Simon wrote: > A bug[1] slipped in with [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094) that broke `HotSpotResolvedJavaMethodImpl.equals`.This PR fixes and adds a test for it. The test was added to `TestResolvedJavaMethod` which was disabled (see [JDK-8249621](https://bugs.openjdk.org/browse/JDK-8249621)). This test class has been re-enabled and the 2 other failing tests in it (`canBeStaticallyBoundTest` and `asStackTraceElementTest`) have been fixed. src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.hotspot/src/jdk/vm/ci/hotspot/HotSpotResolvedJavaMethodImpl.java line 169: > 167: if (obj instanceof HotSpotResolvedJavaMethodImpl) { > 168: HotSpotResolvedJavaMethodImpl that = (HotSpotResolvedJavaMethodImpl) obj; > 169: return that.getMethodPointer() == getMethodPointer(); This is the actual bug fix. ------------- PR: https://git.openjdk.org/jdk/pull/9364 From xgong at openjdk.org Tue Jul 5 07:56:07 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 5 Jul 2022 07:56:07 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system Message-ID: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> The sub-test "`testMaskAndZero()`" failed on x86 systems when `UseAVX=1` with the IR check failure: - counts: Graph contains wrong number of nodes: * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) - Failed comparison: [found] 0 >= 1 [given] - No nodes matched! The root cause is the `VectorMask.fromArray/intoArray` APIs are not intrinsified when "`UseAVX=1`" for long type vectors with following reasons: 1) The system supported max vector size is 128 bits for integral vector operations when "`UseAVX=1`". 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` are not supported for vectors with 2 elements (see [1]). Note that `VectorMask.fromArray()` needs to be intrinsified with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` needs to be intrinsified with "`VectorStoreMask+StoreVector`". Either "`VectorStoreMask`" or "`StoreVector`" not supported by the compiler backend will forbit the relative API intrinsification. Replacing the vector type from Long to other integral types in the test case can fix the issue. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 ------------- Commit messages: - 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system Changes: https://git.openjdk.org/jdk/pull/9373/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9373&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289604 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9373.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9373/head:pull/9373 PR: https://git.openjdk.org/jdk/pull/9373 From aph at openjdk.org Tue Jul 5 07:58:27 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 5 Jul 2022 07:58:27 GMT Subject: Integrated: 8289698: AArch64: Need to relativize extended_sp in frame In-Reply-To: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> References: <-LmNaWVbEO6N1UwcQ5Vm34wCy2OnR7oH6synUbwNh3o=.5cd64b0e-5333-4e1b-990c-969b10593848@github.com> Message-ID: On Mon, 4 Jul 2022 15:39:40 GMT, Andrew Haley wrote: > With the addition of the extended_sp field in interpreter frames, we need to make sure it is de-relativized and re-relativized when freezing and thawing a vthread. This pull request has now been integrated. Changeset: a5934cdd Author: Andrew Haley URL: https://git.openjdk.org/jdk/commit/a5934cddca9b962d8e1b709de23c169904b95525 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8289698: AArch64: Need to relativize extended_sp in frame Reviewed-by: alanb, dholmes ------------- PR: https://git.openjdk.org/jdk/pull/9367 From aph at openjdk.org Tue Jul 5 07:59:40 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 5 Jul 2022 07:59:40 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v4] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 10:34:01 GMT, Aleksey Shipilev wrote: > Looks to me this now subsumes JDK-8289060, but not exactly? For example, the PR for JDK-8289060 has richer comment around `VMRegImpl::stack0`. That's just a mistake: I'm not sure how it crept in. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From aph at openjdk.org Tue Jul 5 08:12:23 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 5 Jul 2022 08:12:23 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v5] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Delete changes to hotspot/shared. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/8f965c9f..df457ba6 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=03-04 Stats: 29 lines in 2 files changed: 3 ins; 15 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From aph at openjdk.org Tue Jul 5 08:12:24 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 5 Jul 2022 08:12:24 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v4] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 10:32:51 GMT, Aleksey Shipilev wrote: >> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: >> >> More > > src/hotspot/share/code/vmreg.hpp line 90: > >> 88: } >> 89: intptr_t value() const { return this - first(); } >> 90: static VMReg Bad() { return BAD_REG+first(); } > > I was confused as to why is it `+first()`. We can probably do: `return as_VMReg(BAD_REG, true);`? > I believe there are some compiler directives somewhere to silent the compiler of `nullptr` dereference, should we delete those also? It depends on exactly what they are. I'll have a look. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From rpressler at openjdk.org Tue Jul 5 08:31:31 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Tue, 5 Jul 2022 08:31:31 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: - Add an "i2i" entry to enterSpecial - Fix comment ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/66/files - new: https://git.openjdk.org/jdk19/pull/66/files/4680aed2..7323f635 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=01-02 Stats: 220 lines in 13 files changed: 166 ins; 33 del; 21 mod Patch: https://git.openjdk.org/jdk19/pull/66.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/66/head:pull/66 PR: https://git.openjdk.org/jdk19/pull/66 From rpressler at openjdk.org Tue Jul 5 08:31:32 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Tue, 5 Jul 2022 08:31:32 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v2] In-Reply-To: References: Message-ID: On Sat, 25 Jun 2022 01:23:47 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Revert "Remove outdated comment" > > This reverts commit 8f571d76e34bc64ceb31894184fba4b909e8fbfe. Trying a new approach of having another entry into `enterSpecial`, used only when in interp-only-mode, and where the call to `Continuation.enter` always resolves to its interpreted version. This requires more platform-specific code, and also makes the frame appear not `frame::safe_for_sender` when at that callsite, but losing an async poll when in interp_only_mode doesn't seem to be a big issue, and the problem can be easily fixed as JFR is too eager to call `frame::safe_for_sender`. Passes tiers 1-4 as well as Loom tiers 1-5. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From jiefu at openjdk.org Tue Jul 5 09:21:42 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 5 Jul 2022 09:21:42 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 Looks good to me. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/9373 From duke at openjdk.org Tue Jul 5 09:41:26 2022 From: duke at openjdk.org (kristylee88) Date: Tue, 5 Jul 2022 09:41:26 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 Marked as reviewed by kristylee88 at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.org/jdk/pull/9373 From rehn at openjdk.org Tue Jul 5 12:44:45 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 5 Jul 2022 12:44:45 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 08:31:31 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: > > - Add an "i2i" entry to enterSpecial > - Fix comment I think this is much better. I'll give it another round tomorrow. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From coleenp at openjdk.org Tue Jul 5 16:39:24 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 5 Jul 2022 16:39:24 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors In-Reply-To: <2smVdrxSBqWvHWeHLSPI4T1MiXN5p-3WmeC6c5-4ERc=.f93f73ef-f5f5-4223-947f-503b924c1568@github.com> References: <2smVdrxSBqWvHWeHLSPI4T1MiXN5p-3WmeC6c5-4ERc=.f93f73ef-f5f5-4223-947f-503b924c1568@github.com> Message-ID: On Fri, 1 Jul 2022 20:16:06 GMT, Dean Long wrote: >> This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. >> >> Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 2563: > >> 2561: if (op->info() != NULL) { >> 2562: int null_check_offset = __ offset(); >> 2563: __ null_check(obj, -1); > > Suggestion: > > __ null_check(obj); Yes, that would be better but it leads to a compilation error on macosx-aarch64 that null_check is ambiguous: src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp:2563:10: error: call to member function 'null_check' is ambiguous [2022-06-30T00:09:19,094Z] __ null_check(obj); [2022-06-30T00:09:19,094Z] ~~~^~~~~~~~~~ src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp:602:16: note: candidate function [2022-06-30T00:09:19,094Z] virtual void null_check(Register reg, int offset = -1); [2022-06-30T00:09:19,094Z] ^ src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.hpp:109:8: note: candidate function [2022-06-30T00:09:19,094Z] void null_check(Register r, Label *Lnull = NULL) { MacroAssembler::null_check(r); } [2022-06-30T00:09:19,094Z] ^ > src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2438: > >> 2436: int null_check_offset = __ offset(); >> 2437: __ null_check(obj); >> 2438: add_debug_info_for_null_check(null_check_offset, op->info()); > > Is this equivalent to the following? > > add_debug_info_for_null_check_here(op->info()); > __ null_check(obj); I don't know. Is it better and preferable? ------------- PR: https://git.openjdk.org/jdk/pull/9339 From coleenp at openjdk.org Tue Jul 5 17:07:28 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Tue, 5 Jul 2022 17:07:28 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors [v2] In-Reply-To: References: Message-ID: > This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. > > Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: Improve c1 code. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9339/files - new: https://git.openjdk.org/jdk/pull/9339/files/912dc5cd..116e2b5a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9339&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9339&range=00-01 Stats: 12 lines in 6 files changed: 0 ins; 6 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9339.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9339/head:pull/9339 PR: https://git.openjdk.org/jdk/pull/9339 From duke at openjdk.org Tue Jul 5 17:54:33 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Tue, 5 Jul 2022 17:54:33 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Fri, 17 Jun 2022 09:25:18 GMT, Andrew Haley wrote: >>> Based on your numbers (bytes saved / number of methods) I believe we're saving 16 bytes per method. >> >> How did you get 16? >> dotty arm64: $ 820544 / 4592 = 179 $ >> >>> How much more is there? What can we do with stubs besides duplicated static stubs removal? >> >> For arm64 we have: moving a pointer to metadata to a register and moving the address of the interpreter to a register. >> >> 0x0000ffff79bd2560: isb ; {static_stub} >> 0x0000ffff79bd2564: mov x12, #0x388 // #904 >> ; {metadata({method} {0x0000ffff18400388} 'error' '(ILjava/lang/String;)V' in 'Test')} >> 0x0000ffff79bd2568: movk x12, #0x1840, lsl #16 >> 0x0000ffff79bd256c: movk x12, #0xffff, lsl #32 >> 0x0000ffff79bd2570: mov x8, #0xe58c // #58764 >> 0x0000ffff79bd2574: movk x8, #0x793b, lsl #16 >> 0x0000ffff79bd2578: movk x8, #0xffff, lsl #32 >> 0x0000ffff79bd257c: br x8 >> >> If we never patch the branch to the interpreter, we can optimize it at link time either to a direct branch or an adrp based far jump. I also created https://bugs.openjdk.org/browse/JDK-8286142 to reduce metadata mov instructions. >> >>> Is it possible (theoretically) to move the stub out of the calling method to share it between methods? >> >> It is possible but it complicates CodeCache maintenance. Stubs use a pointer to metadata. When a class and methods are unloaded, we will need to invalidate all corresponding stubs. >> >> I can check with benchmarks how many stubs can shared among methods. > >> If we never patch the branch to the interpreter, we can optimize it at link time either to a direct branch or an adrp based far jump. I also created https://bugs.openjdk.org/browse/JDK-8286142 to reduce metadata mov instructions. > > If we emit the address of the interpreter once, at the start of the stub section, we can replace the branch to the interpreter with > `ldr rscratch1, adr; br rscratch1`. Hi Andrew(@theRealAph), Your comments are usually highly useful and help to identify missed issues. Do you have any of them? Thanks, Evgeny ------------- PR: https://git.openjdk.org/jdk/pull/8816 From kvn at openjdk.org Tue Jul 5 18:25:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Jul 2022 18:25:39 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 10:19:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Save vect_type to ReductionNode and VectorMaskOpNode I started new testing. ------------- PR: https://git.openjdk.org/jdk/pull/9037 From kvn at openjdk.org Tue Jul 5 19:52:43 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Jul 2022 19:52:43 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 Including @chhagedorn for discussion. I was wondering why our testing with `UseAVX=1` did not catch this issue (test passed). But then I remember that IR framework skip testing if such flag is used by testing. It is not on whitelist: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L104 We are adding more IR framework testing for vectors. I think it is important to test them with different CPU's features **if needed**. https://github.com/openjdk/jdk/pull/8999 added filter to run some sub-tests if CPU's feature is present/absent. But it relies on testing infrastructure to be executed on corresponding machine. It is not reliable (how many machines left with only AVX1). I suggest to allow a test add flags to whitelist with which it allows to run. It (together with #8999 and `@requires`) will allow test's author to specify range of CPU features which can be used for test and make sure they will be run. ------------- PR: https://git.openjdk.org/jdk/pull/9373 From dnsimon at openjdk.org Tue Jul 5 18:27:39 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 5 Jul 2022 18:27:39 GMT Subject: Integrated: 8289687: [JVMCI] bug in HotSpotResolvedJavaMethodImpl.equals In-Reply-To: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> References: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> Message-ID: <_ocooF118-k04Q0d3aW4yrSbLKMsVTf3-3opIYtsJ9s=.bd9aa04f-75d4-4d09-abad-3dd77254867d@github.com> On Mon, 4 Jul 2022 12:30:57 GMT, Doug Simon wrote: > A bug[1] slipped in with [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094) that broke `HotSpotResolvedJavaMethodImpl.equals`.This PR fixes and adds a test for it. The test was added to `TestResolvedJavaMethod` which was disabled (see [JDK-8249621](https://bugs.openjdk.org/browse/JDK-8249621)). This test class has been re-enabled and the 2 other failing tests in it (`canBeStaticallyBoundTest` and `asStackTraceElementTest`) have been fixed. This pull request has now been integrated. Changeset: c45d613f Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/c45d613faa8b8658c714513da89852f1f9ff0a4a Stats: 67 lines in 5 files changed: 40 ins; 1 del; 26 mod 8289687: [JVMCI] bug in HotSpotResolvedJavaMethodImpl.equals Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/9364 From kvn at openjdk.org Tue Jul 5 18:22:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Jul 2022 18:22:24 GMT Subject: RFR: 8289687: [JVMCI] bug in HotSpotResolvedJavaMethodImpl.equals In-Reply-To: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> References: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> Message-ID: On Mon, 4 Jul 2022 12:30:57 GMT, Doug Simon wrote: > A bug[1] slipped in with [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094) that broke `HotSpotResolvedJavaMethodImpl.equals`.This PR fixes and adds a test for it. The test was added to `TestResolvedJavaMethod` which was disabled (see [JDK-8249621](https://bugs.openjdk.org/browse/JDK-8249621)). This test class has been re-enabled and the 2 other failing tests in it (`canBeStaticallyBoundTest` and `asStackTraceElementTest`) have been fixed. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9364 From kvn at openjdk.org Tue Jul 5 18:50:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Jul 2022 18:50:51 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 16:39:29 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8289186: Review comments resolved. src/hotspot/cpu/x86/x86.ad line 1767: > 1765: break; > 1766: case Op_StoreVectorMasked: > 1767: if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) { Why duplicate the same check? And you did not answer my suggestion about modifying the check. src/hotspot/share/opto/vectorIntrinsics.cpp line 313: > 311: if (!is_supported && (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked)) { > 312: return true; > 313: } Still unclear for me. As I understand `the upfront checks`, you mention, are checks at lines 270 and 286. So we know that `StoreVectorMasked` and `LoadVectorMasked` are supported. The second part of your comment is talking about `non-predicated targets`. Does it mean that the real check should be next?: if (Matcher::has_predicated_vectors()) { .... } else if (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked) { // your comment return true; } ``` Or I am still missing what are you trying to do here? src/hotspot/share/opto/vectornode.hpp line 907: > 905: StoreVectorMaskedNode(Node* c, Node* mem, Node* dst, Node* src, const TypePtr* at, Node* mask) > 906: : StoreVectorNode(c, mem, dst, at, src) { > 907: assert(mask->bottom_type()->isa_vectmask(), "sanity"); Why the assert was added before? And why you can remove it now? src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3923: > 3921: // End of low-level memory operations. > 3922: > 3923: @ForceInline Why this change? Was it missing before or you found that based on testing? What is criteria to add `@ForceInline`? ------------- PR: https://git.openjdk.org/jdk/pull/9324 From dnsimon at openjdk.org Tue Jul 5 18:27:38 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Tue, 5 Jul 2022 18:27:38 GMT Subject: RFR: 8289687: [JVMCI] bug in HotSpotResolvedJavaMethodImpl.equals In-Reply-To: References: <4U22P6UQk9wez--rPvoA7ql-_4gI3AMSlYkdQ7xppcw=.28bdbe28-09d0-4a9d-a2f4-67ad99da7186@github.com> Message-ID: <43Rqb3hvM8c-UPEyZJAVtGz91lg-jyM-uvbqdtOolGw=.0c4168cd-fe8a-4a36-898c-376be8489e30@github.com> On Tue, 5 Jul 2022 18:19:12 GMT, Vladimir Kozlov wrote: >> A bug[1] slipped in with [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094) that broke `HotSpotResolvedJavaMethodImpl.equals`.This PR fixes and adds a test for it. The test was added to `TestResolvedJavaMethod` which was disabled (see [JDK-8249621](https://bugs.openjdk.org/browse/JDK-8249621)). This test class has been re-enabled and the 2 other failing tests in it (`canBeStaticallyBoundTest` and `asStackTraceElementTest`) have been fixed. > > Looks good. Thanks for the review @vnkozlov . ------------- PR: https://git.openjdk.org/jdk/pull/9364 From kvn at openjdk.org Tue Jul 5 19:11:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 5 Jul 2022 19:11:31 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/9373 From phh at openjdk.org Tue Jul 5 20:47:58 2022 From: phh at openjdk.org (Paul Hohensee) Date: Tue, 5 Jul 2022 20:47:58 GMT Subject: RFR: 8280481: Duplicated stubs to interpreter for static calls [v2] In-Reply-To: References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: <6-SqG67oF4oG2WCqTq-udI2aeXEyDPyw31po646cjt4=.6d0a11fd-f6fe-4906-95d4-b82ac14f5f66@github.com> On Wed, 29 Jun 2022 14:50:59 GMT, Evgeny Astigeevich wrote: >> ## Problem >> Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. >> >> Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides the address of the stub and the address of its owner. `relocInfo` has an offset which is an offset from the previously known relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. >> >> Each Java call has: >> - A relocation for a call site. >> - A relocation for a stub to the interpreter. >> - A stub to the interpreter. >> - If far jumps are used (arm64 case): >> - A trampoline relocation. >> - A trampoline. >> >> We cannot avoid creating relocations. They are needed to support patching call sites. >> With shared stubs there will be multiple relocations having the same stub address but different owners' addresses. >> If we try to generate relocations as we go there will be a case which requires negative offsets: >> >> reloc1 ---> 0x0: stub1 >> reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) >> reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) >> >> >> `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward. >> Negative offsets reduce the offset range by half. This can increase filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. >> >> ## Solution >> In this PR creation of stubs is done in two stages. First we collect requests for creating shared stubs: a callee `ciMethod*` and an offset of a call in `CodeBuffer` (see [src/hotspot/share/asm/codeBuffer.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae)). Then we have the finalisation phase (see [src/hotspot/share/ci/ciEnv.cpp](https://github.com/openjdk/jdk/pull/8816/files#diff-7c032de54e85754d39e080fd24d49b7469543b163f54229eb0631c6b1bf26450)), where `CodeBuffer::finalize_stubs()` creates shared stubs in `CodeBuffer`: a stub and multiple relocations sharing it. The first relocation will have positive offset. The rest will have zero offsets. This approach does not need negative offsets. As creation of relocations and stubs is platform dependent, `CodeBuffer::finalize_stubs()` calls `CodeBuffer::pd_finalize_stubs()` where platforms should put their code. >> >> This PR provides implementations for x86, x86_64 and aarch64. [src/hotspot/share/asm/codeBuffer.inline.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-c268e3719578f2980edaa27c0eacbe9f620124310108eb65d0f765212c7042eb) provides the `emit_shared_stubs_to_interp` template which x86, x86_64 and aarch64 platforms use. Other platforms can use it too. Platforms supporting shared stubs to the interpreter must have `CodeBuffer::supports_shared_stubs()` returning `true`. >> >> ## Results >> **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** >> Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. >> - AArch64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 820544 | 4592 | 18872 | >> | dec-tree | 405280 | 2580 | 22335 | >> | naive-bayes | 392384 | 2586 | 21184 | >> | log-regression | 362208 | 2450 | 20325 | >> | als | 306048 | 2226 | 18161 | >> | finagle-chirper | 262304 | 2087 | 12675 | >> | movie-lens | 250112 | 1937 | 13617 | >> | gauss-mix | 173792 | 1262 | 10304 | >> | finagle-http | 164320 | 1392 | 11269 | >> | page-rank | 155424 | 1175 | 10330 | >> | chi-square | 140384 | 1028 | 9480 | >> | akka-uct | 115136 | 541 | 3941 | >> | reactors | 43264 | 335 | 2503 | >> | scala-stm-bench7 | 42656 | 326 | 3310 | >> | philosophers | 36576 | 256 | 2902 | >> | scala-doku | 35008 | 231 | 2695 | >> | rx-scrabble | 32416 | 273 | 2789 | >> | future-genetic | 29408 | 260 | 2339 | >> | scrabble | 27968 | 225 | 2477 | >> | par-mnemonics | 19584 | 168 | 1689 | >> | fj-kmeans | 19296 | 156 | 1647 | >> | scala-kmeans | 18080 | 140 | 1629 | >> | mnemonics | 17408 | 143 | 1512 | >> +------------------+-------------+----------------------------+---------------------+ >> >> - X86_64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 337065 | 4403 | 19135 | >> | dec-tree | 183045 | 2559 | 22071 | >> | naive-bayes | 176460 | 2450 | 19782 | >> | log-regression | 162555 | 2410 | 20648 | >> | als | 121275 | 1980 | 17179 | >> | movie-lens | 111915 | 1842 | 13020 | >> | finagle-chirper | 106350 | 1947 | 12726 | >> | gauss-mix | 81975 | 1251 | 10474 | >> | finagle-http | 80895 | 1523 | 12294 | >> | page-rank | 68940 | 1146 | 10124 | >> | chi-square | 62130 | 974 | 9315 | >> | akka-uct | 50220 | 555 | 4263 | >> | reactors | 23385 | 371 | 2544 | >> | philosophers | 17625 | 259 | 2865 | >> | scala-stm-bench7 | 17235 | 295 | 3230 | >> | scala-doku | 15600 | 214 | 2698 | >> | rx-scrabble | 14190 | 262 | 2770 | >> | future-genetic | 13155 | 253 | 2318 | >> | scrabble | 12300 | 217 | 2352 | >> | fj-kmeans | 8985 | 157 | 1616 | >> | par-mnemonics | 8535 | 155 | 1684 | >> | scala-kmeans | 8250 | 138 | 1624 | >> | mnemonics | 7485 | 134 | 1522 | >> +------------------+-------------+----------------------------+---------------------+ >> >> >> **Testing: fastdebug and release builds for x86, x86_64 and aarch64** >> - `tier1`...`tier4`: Passed >> - `hotspot/jtreg/compiler/sharedstubs`: Passed > > Evgeny Astigeevich has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision: > > - Merge branch 'master' into JDK-8280481C > - Use call offset instead of caller pc > - Simplify test > - Fix x86 build failures > - Remove UseSharedStubs and clarify shared stub use cases > - Make SharedStubToInterpRequest ResourceObj and set initial size of SharedStubToInterpRequests to 8 > - Update copyright year and add Unimplemented guards > - Set UseSharedStubs to true for X86 > - Set UseSharedStubs to true for AArch64 > - Fix x86 build failure > - ... and 10 more: https://git.openjdk.org/jdk/compare/eee4bf15...da3bfb5b Lgtm. ------------- Marked as reviewed by phh (Reviewer). PR: https://git.openjdk.org/jdk/pull/8816 From duke at openjdk.org Tue Jul 5 20:53:39 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Tue, 5 Jul 2022 20:53:39 GMT Subject: Integrated: 8280481: Duplicated stubs to interpreter for static calls In-Reply-To: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> References: <9N1GcHDRvyX1bnPrRcyw96zWIgrrAm4mfrzp8dQ-BBk=.6d55c5fd-7d05-4058-99b6-7d40a92450bf@github.com> Message-ID: On Fri, 20 May 2022 16:27:51 GMT, Evgeny Astigeevich wrote: > ## Problem > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides the address of the stub and the address of its owner. `relocInfo` has an offset which is an offset from the previously known relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites. > With shared stubs there will be multiple relocations having the same stub address but different owners' addresses. > If we try to generate relocations as we go there will be a case which requires negative offsets: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > > `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward. > Negative offsets reduce the offset range by half. This can increase filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > ## Solution > In this PR creation of stubs is done in two stages. First we collect requests for creating shared stubs: a callee `ciMethod*` and an offset of a call in `CodeBuffer` (see [src/hotspot/share/asm/codeBuffer.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-deb8ab083311ba60c0016dc34d6518579bbee4683c81e8d348982bac897fe8ae)). Then we have the finalisation phase (see [src/hotspot/share/ci/ciEnv.cpp](https://github.com/openjdk/jdk/pull/8816/files#diff-7c032de54e85754d39e080fd24d49b7469543b163f54229eb0631c6b1bf26450)), where `CodeBuffer::finalize_stubs()` creates shared stubs in `CodeBuffer`: a stub and multiple relocations sharing it. The first relocation will have positive offset. The rest will have zero offsets. This approach does not need negative offsets. As creation of relocations and stubs is platform dependent, `CodeBuffer::finalize_stubs()` calls `CodeBuffer::pd_finalize_stubs()` where platforms should put their code. > > This PR provides implementations for x86, x86_64 and aarch64. [src/hotspot/share/asm/codeBuffer.inline.hpp](https://github.com/openjdk/jdk/pull/8816/files#diff-c268e3719578f2980edaa27c0eacbe9f620124310108eb65d0f765212c7042eb) provides the `emit_shared_stubs_to_interp` template which x86, x86_64 and aarch64 platforms use. Other platforms can use it too. Platforms supporting shared stubs to the interpreter must have `CodeBuffer::supports_shared_stubs()` returning `true`. > > ## Results > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 337065 | 4403 | 19135 | > | dec-tree | 183045 | 2559 | 22071 | > | naive-bayes | 176460 | 2450 | 19782 | > | log-regression | 162555 | 2410 | 20648 | > | als | 121275 | 1980 | 17179 | > | movie-lens | 111915 | 1842 | 13020 | > | finagle-chirper | 106350 | 1947 | 12726 | > | gauss-mix | 81975 | 1251 | 10474 | > | finagle-http | 80895 | 1523 | 12294 | > | page-rank | 68940 | 1146 | 10124 | > | chi-square | 62130 | 974 | 9315 | > | akka-uct | 50220 | 555 | 4263 | > | reactors | 23385 | 371 | 2544 | > | philosophers | 17625 | 259 | 2865 | > | scala-stm-bench7 | 17235 | 295 | 3230 | > | scala-doku | 15600 | 214 | 2698 | > | rx-scrabble | 14190 | 262 | 2770 | > | future-genetic | 13155 | 253 | 2318 | > | scrabble | 12300 | 217 | 2352 | > | fj-kmeans | 8985 | 157 | 1616 | > | par-mnemonics | 8535 | 155 | 1684 | > | scala-kmeans | 8250 | 138 | 1624 | > | mnemonics | 7485 | 134 | 1522 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed This pull request has now been integrated. Changeset: 35156041 Author: Evgeny Astigeevich Committer: Paul Hohensee URL: https://git.openjdk.org/jdk/commit/351560414d7ddc0694126ab184bdb78be604e51f Stats: 491 lines in 22 files changed: 458 ins; 5 del; 28 mod 8280481: Duplicated stubs to interpreter for static calls Reviewed-by: kvn, phh ------------- PR: https://git.openjdk.org/jdk/pull/8816 From dlong at openjdk.org Tue Jul 5 23:40:49 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 5 Jul 2022 23:40:49 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors [v2] In-Reply-To: References: <2smVdrxSBqWvHWeHLSPI4T1MiXN5p-3WmeC6c5-4ERc=.f93f73ef-f5f5-4223-947f-503b924c1568@github.com> Message-ID: On Tue, 5 Jul 2022 16:36:03 GMT, Coleen Phillimore wrote: >> src/hotspot/cpu/arm/c1_LIRAssembler_arm.cpp line 2438: >> >>> 2436: int null_check_offset = __ offset(); >>> 2437: __ null_check(obj); >>> 2438: add_debug_info_for_null_check(null_check_offset, op->info()); >> >> Is this equivalent to the following? >> >> add_debug_info_for_null_check_here(op->info()); >> __ null_check(obj); > > I don't know. Is it better and preferable? It looks better. I like it better. Thanks for changing it. ------------- PR: https://git.openjdk.org/jdk/pull/9339 From jbhateja at openjdk.org Wed Jul 6 13:18:05 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:18:05 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 18:29:42 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8289186: Review comments resolved. > > src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3923: > >> 3921: // End of low-level memory operations. >> 3922: >> 3923: @ForceInline > > Why this change? Was it missing before or you found that based on testing? > What is criteria to add `@ForceInline`? Thanks for highlighting this, checkMaskFromIndexSize is being used to test illegal memory access cases with out-of-range offsets i.e. tail scenarios, thus profile based invocation count will always be low for these calls. I saw improved performance on targeted micros, but I get your point that aggressive forced in-lining on non-frequently taken paths can have adverse performance side-effects. But then it may overshadow some of the performance gains due to masked load/strores support on tail paths, but we still see a modest gain in order of 2-3x vs original 10x gain for non-sub word types over baseline. Benchmark (inSize) (outSize) Mode Cnt Score Error Units LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 748.793 ops/ms LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 381.655 ops/ms LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 741.809 ops/ms LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 757.433 ops/ms LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 386.450 ops/ms LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 471.260 ops/ms ------------- PR: https://git.openjdk.org/jdk/pull/9324 From xgong at openjdk.org Wed Jul 6 07:52:31 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 07:52:31 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 18:31:31 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8289186: Review comments resolved. > > src/hotspot/share/opto/vectornode.hpp line 907: > >> 905: StoreVectorMaskedNode(Node* c, Node* mem, Node* dst, Node* src, const TypePtr* at, Node* mask) >> 906: : StoreVectorNode(c, mem, dst, at, src) { >> 907: assert(mask->bottom_type()->isa_vectmask(), "sanity"); > > Why the assert was added before? And why you can remove it now? Does this mean the `mask` input can be a normal `vect_type ` like the mask input of `VectorBlend` for `LoadVectorMasked/StoreVectorMasked` over X86 AVX2 systems? They do not depend on the predicated feature for the AVX2 systems, right? ------------- PR: https://git.openjdk.org/jdk/pull/9324 From jbhateja at openjdk.org Wed Jul 6 13:18:02 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:18:02 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v3] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 07:48:50 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectornode.hpp line 907: >> >>> 905: StoreVectorMaskedNode(Node* c, Node* mem, Node* dst, Node* src, const TypePtr* at, Node* mask) >>> 906: : StoreVectorNode(c, mem, dst, at, src) { >>> 907: assert(mask->bottom_type()->isa_vectmask(), "sanity"); >> >> Why the assert was added before? And why you can remove it now? > > Does this mean the `mask` input can be a normal `vect_type ` like the mask input of `VectorBlend` for `LoadVectorMasked/StoreVectorMasked` over X86 AVX2 systems? They do not depend on the predicated feature for the AVX2 systems, right? IR was specifically added for predicated targets (AVX512 and ARM's SVE), this patch is re-using this IR for non-predicated AVX2 target where mask could be a vector type. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From jbhateja at openjdk.org Wed Jul 6 13:24:27 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:24:27 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 13:14:53 GMT, Jatin Bhateja wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 98: >> >>> 96: for (int i = 0; i < inSize; i += bspecies.length()) { >>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); >>> 98: ByteVector.fromArray(bspecies, byteIn, i, mask).intoArray(byteOut, i, mask); >> >> Could you please add new benchmarks for masked `store` ? > > Done. Here are results of new benchmark. BaseLine: Benchmark (inSize) (outSize) Mode Cnt Score Error Units StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE 1024 1022 thrpt 2 772.555 ops/ms StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE 1024 1022 thrpt 2 180.548 ops/ms StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE 1024 1022 thrpt 2 311.500 ops/ms StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE 1024 1022 thrpt 2 312.457 ops/ms StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE 1024 1022 thrpt 2 181.013 ops/ms StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE 1024 1022 thrpt 2 538.537 ops/ms WithOpt: Benchmark (inSize) (outSize) Mode Cnt Score Error Units StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE 1024 1022 thrpt 2 757.079 ops/ms StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE 1024 1022 thrpt 2 1553.923 ops/ms StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE 1024 1022 thrpt 2 3060.020 ops/ms StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE 1024 1022 thrpt 2 3025.225 ops/ms StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE 1024 1022 thrpt 2 1562.263 ops/ms StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE 1024 1022 thrpt 2 538.931 ops/ms ------------- PR: https://git.openjdk.org/jdk/pull/9324 From chagedorn at openjdk.org Wed Jul 6 07:11:48 2022 From: chagedorn at openjdk.org (Christian Hagedorn) Date: Wed, 6 Jul 2022 07:11:48 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 > Including @chhagedorn for discussion. > > I was wondering why our testing with `UseAVX=1` did not catch this issue (test passed). But then I remember that IR framework skip testing if such flag is used by testing. It is not on whitelist: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L104 > > We are adding more IR framework testing for vectors. I think it is important to test them with different CPU's features **if needed**. > > #8999 added filter to run some sub-tests if CPU's feature is present/absent. But it relies on testing infrastructure to be executed on corresponding machine. It is not reliable (how many machines left with only AVX1). > > I suggest to allow a test add flags to whitelist with which it allows to run. It (together with #8999 and `@requires`) will allow test's author to specify range of CPU features which can be used for test and make sure they will be run. Right, `UseAVX` is not a whitelisted flag and thus IR matching is disabled. I think it would be a good idea to consider whitelisting this flag in general to allow (pre-integration) testing of such different default machine setups (we could be executing any test on such a machine at some point). Same for other flags such as `UseSVE`, `UseSSE` etc. I've filed [JDK-8289801](https://bugs.openjdk.org/browse/JDK-8289801) to follow up on that. I will be on vacation starting from tomorrow. If someone wants to take this over, please go ahead. Otherwise, I'll come back to this when I'm back at the beginning of August. Thanks, Christian ------------- PR: https://git.openjdk.org/jdk/pull/9373 From jbhateja at openjdk.org Wed Jul 6 13:18:07 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:18:07 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v3] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 02:19:14 GMT, Xiaohong Gong wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8289186: Review comments resolved. > > test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 98: > >> 96: for (int i = 0; i < inSize; i += bspecies.length()) { >> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); >> 98: ByteVector.fromArray(bspecies, byteIn, i, mask).intoArray(byteOut, i, mask); > > Could you please add new benchmarks for masked `store` ? Done. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From xgong at openjdk.org Wed Jul 6 07:43:32 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 07:43:32 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 18:46:30 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8289186: Review comments resolved. > > src/hotspot/cpu/x86/x86.ad line 1767: > >> 1765: break; >> 1766: case Op_StoreVectorMasked: >> 1767: if (!VM_Version::supports_avx512bw() && (is_subword_type(bt) || UseAVX < 1)) { > > Why duplicate the same check? > And you did not answer my suggestion about modifying the check. Yes, a fall-through can also work here. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From rehn at openjdk.org Wed Jul 6 12:50:39 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Wed, 6 Jul 2022 12:50:39 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 09:44:23 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Changes following review comments Marked as reviewed by rehn (Reviewer). ------------- PR: https://git.openjdk.org/jdk19/pull/66 From coleenp at openjdk.org Wed Jul 6 12:08:43 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 6 Jul 2022 12:08:43 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors [v2] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 17:07:28 GMT, Coleen Phillimore wrote: >> This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. >> >> Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. > > Coleen Phillimore has updated the pull request incrementally with one additional commit since the last revision: > > Improve c1 code. Thanks Dean for the review and improvement. ------------- PR: https://git.openjdk.org/jdk/pull/9339 From roland at openjdk.org Wed Jul 6 07:21:38 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Jul 2022 07:21:38 GMT Subject: RFR: 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible [v2] In-Reply-To: References: Message-ID: <4iKcC53CAHtz4j3OsqyPs-aJZ6hDd2QIdBLbXVBtytI=.4059b58b-2656-4fae-82bb-6abdbb26e855@github.com> > This implements a transformation that already exists for CastII and > ConvI2L and helps code generation. The tricky part is that: > > (CastII (AddI into (AddI (CastII > > is performed by first computing the bounds of the type of the AddI. To > protect against overflow, jlong variables are used. With CastLL/AddL > nodes there's no larger integer type to promote the bounds to. As a > consequence the logic in the patch explicitly tests for overflow. That > logic is shared by the int and long cases. The previous logic for the > int cases that promotes values to long is used as verification. > > This patch also widens the type of CastLL nodes after loop opts the > way it's done for CastII/ConvI2L to allow commoning of nodes. > > This was observed to help with Memory Segment micro benchmarks. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: Update test/hotspot/jtreg/compiler/c2/irTests/TestPushAddThruCast.java Co-authored-by: Tobias Hartmann ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9139/files - new: https://git.openjdk.org/jdk/pull/9139/files/5579e664..80577689 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9139&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9139&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9139.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9139/head:pull/9139 PR: https://git.openjdk.org/jdk/pull/9139 From eosterlund at openjdk.org Wed Jul 6 11:23:35 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 6 Jul 2022 11:23:35 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 08:55:10 GMT, Ron Pressler wrote: >> src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1058: >> >>> 1056: >>> 1057: address mark = __ pc(); >>> 1058: __ trampoline_call1(resolve, NULL, false); >> >> I don't think it's necessary to call the resolve stub when in interpreted mode. Can't we just call the Method's c2i adapter just like the interpreter would? I guess there might be a startup issue if the adapter hasn't been generated yet. > > I couldn't find code that does that and could be easily reused. The callee belongs to the same class as the caller, so they should both be linked in this context. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Wed Jul 6 07:19:42 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 07:19:42 GMT Subject: RFR: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors [v2] In-Reply-To: References: <2smVdrxSBqWvHWeHLSPI4T1MiXN5p-3WmeC6c5-4ERc=.f93f73ef-f5f5-4223-947f-503b924c1568@github.com> Message-ID: On Tue, 5 Jul 2022 16:34:53 GMT, Coleen Phillimore wrote: >> src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 2563: >> >>> 2561: if (op->info() != NULL) { >>> 2562: int null_check_offset = __ offset(); >>> 2563: __ null_check(obj, -1); >> >> Suggestion: >> >> __ null_check(obj); > > Yes, that would be better but it leads to a compilation error on macosx-aarch64 that null_check is ambiguous: > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp:2563:10: error: call to member function 'null_check' is ambiguous > [2022-06-30T00:09:19,094Z] __ null_check(obj); > [2022-06-30T00:09:19,094Z] ~~~^~~~~~~~~~ > src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp:602:16: note: candidate function > [2022-06-30T00:09:19,094Z] virtual void null_check(Register reg, int offset = -1); > [2022-06-30T00:09:19,094Z] ^ > src/hotspot/cpu/aarch64/c1_MacroAssembler_aarch64.hpp:109:8: note: candidate function > [2022-06-30T00:09:19,094Z] void null_check(Register r, Label *Lnull = NULL) { MacroAssembler::null_check(r); } > [2022-06-30T00:09:19,094Z] ^ OK, nevermind. ------------- PR: https://git.openjdk.org/jdk/pull/9339 From roland at openjdk.org Wed Jul 6 11:39:01 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Jul 2022 11:39:01 GMT Subject: RFR: 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible [v2] In-Reply-To: References: Message-ID: <80GedXG1YxPyiBCFo9CrTOMaJu0YyNPZ92qpp2dSudM=.90a10054-e15a-4771-8812-a67266d6f88c@github.com> On Mon, 20 Jun 2022 07:46:05 GMT, Tobias Hartmann wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> Update test/hotspot/jtreg/compiler/c2/irTests/TestPushAddThruCast.java >> >> Co-authored-by: Tobias Hartmann > > Looks correct. A second review would be good. @TobiHartmann @vnkozlov thanks for the reviews ------------- PR: https://git.openjdk.org/jdk/pull/9139 From coleenp at openjdk.org Wed Jul 6 12:12:42 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Wed, 6 Jul 2022 12:12:42 GMT Subject: Integrated: 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors In-Reply-To: References: Message-ID: On Thu, 30 Jun 2022 22:05:14 GMT, Coleen Phillimore wrote: > This change adds a null check before calling into Runtime1::monitorenter when -XX:+UseHeavyMonitors is set. There's a null check in the C2 and interpreter code before calling the runtime function but not C1. > > Tested with tier1-7 (a little of 8) and built on most non-oracle platforms as well. This pull request has now been integrated. Changeset: 83a5d599 Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/83a5d5996bca26b5f2e97b67f9bfd0a6ad110327 Stats: 24 lines in 6 files changed: 24 ins; 0 del; 0 mod 8278479: RunThese test failure with +UseHeavyMonitors and +VerifyHeavyMonitors Reviewed-by: kvn, dcubed, dlong ------------- PR: https://git.openjdk.org/jdk/pull/9339 From jbhateja at openjdk.org Wed Jul 6 13:24:27 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:24:27 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 13:14:53 GMT, Jatin Bhateja wrote: >> test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 98: >> >>> 96: for (int i = 0; i < inSize; i += bspecies.length()) { >>> 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); >>> 98: ByteVector.fromArray(bspecies, byteIn, i, mask).intoArray(byteOut, i, mask); >> >> Could you please add new benchmarks for masked `store` ? > > Done. Here are results of new benchmark. BaseLine: Benchmark (inSize) (outSize) Mode Cnt Score Error Units StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE 1024 1022 thrpt 2 772.555 ops/ms StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE 1024 1022 thrpt 2 180.548 ops/ms StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE 1024 1022 thrpt 2 311.500 ops/ms StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE 1024 1022 thrpt 2 312.457 ops/ms StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE 1024 1022 thrpt 2 181.013 ops/ms StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE 1024 1022 thrpt 2 538.537 ops/ms WithOpt: Benchmark (inSize) (outSize) Mode Cnt Score Error Units StoreMaskedIOOBEBenchmark.byteStoreArrayMaskIOOBE 1024 1022 thrpt 2 757.079 ops/ms StoreMaskedIOOBEBenchmark.doubleStoreArrayMaskIOOBE 1024 1022 thrpt 2 1553.923 ops/ms StoreMaskedIOOBEBenchmark.floatStoreArrayMaskIOOBE 1024 1022 thrpt 2 3060.020 ops/ms StoreMaskedIOOBEBenchmark.intStoreArrayMaskIOOBE 1024 1022 thrpt 2 3025.225 ops/ms StoreMaskedIOOBEBenchmark.longStoreArrayMaskIOOBE 1024 1022 thrpt 2 1562.263 ops/ms StoreMaskedIOOBEBenchmark.shortStoreArrayMaskIOOBE 1024 1022 thrpt 2 538.931 ops/ms ------------- PR: https://git.openjdk.org/jdk/pull/9324 From aph at openjdk.org Wed Jul 6 15:25:45 2022 From: aph at openjdk.org (Andrew Haley) Date: Wed, 6 Jul 2022 15:25:45 GMT Subject: Integrated: 8289060: Undefined Behaviour in class VMReg In-Reply-To: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> References: <3TzV1cxfovNTIdvELrSKb1-897YpS4Th5Gc7YwjsYT8=.5ecc70e2-fc67-4851-a18f-c721c8397186@github.com> Message-ID: <9fXweMztxKzl4RrAXSCl-c7r1fXJdoSJlOUthsNw99A=.d0e93f80-2a95-4ec5-a65f-333ee4cb9362@github.com> On Fri, 24 Jun 2022 13:58:29 GMT, Andrew Haley wrote: > Like class `Register`, class `VMReg` exhibits undefined behaviour, in particular null pointer dereferences. > > The right way to fix this is simple: make instances of `VMReg` point to reified instances of `VMRegImpl`. We do this by creating a static array of `VMRegImpl`, and making all `VMReg` instances point into it, making the code well defined. > > However, while `VMReg` instances are no longer null, and so do not generate compile warnings or errors, there is still a problem in that higher-numbered `VMReg` instances point outside the static array of `VMRegImpl`. This is hard to avoid, given that (as far as I can tell) there is no upper limit on the number of stack slots that can be allocated as `VMReg` instances. While this is in theory UB, it's not likely to cause problems. We could fix this by creating a much larger static array of `VMRegImpl`, up to the largest plausible size of stack offsets. > > We could instead make `VMReg` instances objects with a single numeric field rather than pointers, but some C++ compilers pass all such objects by reference, so I don't think we should. This pull request has now been integrated. Changeset: dfb24ae4 Author: Andrew Haley URL: https://git.openjdk.org/jdk/commit/dfb24ae4b7d32c0c625a9396429d167d9dcca183 Stats: 40 lines in 3 files changed: 20 ins; 3 del; 17 mod 8289060: Undefined Behaviour in class VMReg Reviewed-by: jvernee, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9276 From roland at openjdk.org Wed Jul 6 11:39:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Wed, 6 Jul 2022 11:39:03 GMT Subject: Integrated: 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible In-Reply-To: References: Message-ID: On Mon, 13 Jun 2022 08:26:47 GMT, Roland Westrelin wrote: > This implements a transformation that already exists for CastII and > ConvI2L and helps code generation. The tricky part is that: > > (CastII (AddI into (AddI (CastII > > is performed by first computing the bounds of the type of the AddI. To > protect against overflow, jlong variables are used. With CastLL/AddL > nodes there's no larger integer type to promote the bounds to. As a > consequence the logic in the patch explicitly tests for overflow. That > logic is shared by the int and long cases. The previous logic for the > int cases that promotes values to long is used as verification. > > This patch also widens the type of CastLL nodes after loop opts the > way it's done for CastII/ConvI2L to allow commoning of nodes. > > This was observed to help with Memory Segment micro benchmarks. This pull request has now been integrated. Changeset: cbaf6e80 Author: Roland Westrelin URL: https://git.openjdk.org/jdk/commit/cbaf6e807e2b959a0264c87035916850798a2dc6 Stats: 519 lines in 8 files changed: 400 ins; 93 del; 26 mod 8288022: c2: Transform (CastLL (AddL into (AddL (CastLL when possible Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9139 From dlong at openjdk.org Wed Jul 6 02:22:48 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 02:22:48 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 08:31:31 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: > > - Add an "i2i" entry to enterSpecial > - Fix comment src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1058: > 1056: > 1057: address mark = __ pc(); > 1058: __ trampoline_call1(resolve, NULL, false); I don't think it's necessary to call the resolve stub when in interpreted mode. Can't we just call the Method's c2i adapter just like the interpreter would? I guess there might be a startup issue if the adapter hasn't been generated yet. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From kvn at openjdk.org Wed Jul 6 06:01:27 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Jul 2022 06:01:27 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 tier1-tier2 testing passed. But I am still waiting result of run with failed test on machine with AVX1 only (it is in scheduled state). ------------- PR: https://git.openjdk.org/jdk/pull/9373 From jbhateja at openjdk.org Wed Jul 6 13:17:59 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:17:59 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 02:46:42 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 313: >> >>> 311: if (!is_supported && (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked)) { >>> 312: return true; >>> 313: } >> >> Still unclear for me. As I understand `the upfront checks`, you mention, are checks at lines 270 and 286. So we know that `StoreVectorMasked` and `LoadVectorMasked` are supported. >> The second part of your comment is talking about `non-predicated targets`. Does it mean that the real check should be next?: >> >> if (Matcher::has_predicated_vectors()) { >> .... >> } else if (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked) { >> // your comment >> return true; >> } >> ``` >> Or I am still missing what are you trying to do here? > > Could we change the `VectorMaskUseType` here https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L1221 instead of doing the modification here? Originally we assume `LoadVectorMasked/StoreVectorMasked` is implemented with predicated feature, we added the `VectorMaskUsePred`. If the assumption is changed now, we can just remove `VectorMaskUsePred` and so only `match_rule_supported_vector` is checked for such ops. > Hi Vladimir, L220 is checking for backend support for these IR using match_rule_supported_vector, if targets do not support them then we exit early, LoadVectorMasked and StoreVectorMasked are special IR node which were added only for predicated targets. I agree with @XiaohongGong that we can remove this constraint for masked memory operations from the caller side i.e. LibraryCallKit::inline_vector_mem_masked_operation, such a constraint is only useful for shared IR nodes which can carry additional mask edge on common IR for predicated targets, since there we need to check for existence of blend + vector op as a fall back case. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From dlong at openjdk.org Wed Jul 6 02:51:56 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 02:51:56 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 08:31:31 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: > > - Add an "i2i" entry to enterSpecial > - Fix comment I like the new version. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From eosterlund at openjdk.org Wed Jul 6 11:23:33 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Wed, 6 Jul 2022 11:23:33 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 09:44:23 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: > > Changes following review comments Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR: https://git.openjdk.org/jdk19/pull/66 From xgong at openjdk.org Wed Jul 6 06:18:43 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 06:18:43 GMT Subject: Integrated: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 This pull request has now been integrated. Changeset: fafe8b3f Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/fafe8b3f8dc1bdb7216f2b02416487a2c5fd9a26 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system Reviewed-by: jiefu, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9373 From xgong at openjdk.org Wed Jul 6 02:35:43 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 02:35:43 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 16:39:29 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8289186: Review comments resolved. test/micro/org/openjdk/bench/jdk/incubator/vector/LoadMaskedIOOBEBenchmark.java line 98: > 96: for (int i = 0; i < inSize; i += bspecies.length()) { > 97: VectorMask mask = VectorMask.fromArray(bspecies, m, i); > 98: ByteVector.fromArray(bspecies, byteIn, i, mask).intoArray(byteOut, i, mask); Could you please add new benchmarks for masked `store` ? ------------- PR: https://git.openjdk.org/jdk/pull/9324 From xgong at openjdk.org Wed Jul 6 06:04:45 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 06:04:45 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 05:54:24 GMT, Vladimir Kozlov wrote: > Results are good. Thanks a lot for your time again! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Wed Jul 6 06:15:26 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 06:15:26 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Wed, 6 Jul 2022 06:08:37 GMT, Vladimir Kozlov wrote: > Finally few runs with failed tests were executed on AVX1 machine and they passed. Thanks for the testing! ------------- PR: https://git.openjdk.org/jdk/pull/9373 From xgong at openjdk.org Wed Jul 6 02:07:38 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 6 Jul 2022 02:07:38 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 19:48:41 GMT, Vladimir Kozlov wrote: >> The sub-test "`testMaskAndZero()`" failed on x86 systems when >> `UseAVX=1` with the IR check failure: >> >> - counts: Graph contains wrong number of nodes: >> * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) >> - Failed comparison: [found] 0 >= 1 [given] >> - No nodes matched! >> >> The root cause is the `VectorMask.fromArray/intoArray` APIs >> are not intrinsified when "`UseAVX=1`" for long type vectors >> with following reasons: >> 1) The system supported max vector size is 128 bits for >> integral vector operations when "`UseAVX=1`". >> 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` >> are not supported for vectors with 2 elements (see [1]). >> >> Note that `VectorMask.fromArray()` needs to be intrinsified >> with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` >> needs to be intrinsified with "`VectorStoreMask+StoreVector`". >> Either "`VectorStoreMask`" or "`StoreVector`" not supported by the >> compiler backend will forbit the relative API intrinsification. >> >> Replacing the vector type from Long to other integral types >> in the test case can fix the issue. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 > > Including @chhagedorn for discussion. > > I was wondering why our testing with `UseAVX=1` did not catch this issue (test passed). > But then I remember that IR framework skip testing if such flag is used by testing. It is not on whitelist: > https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L104 > > We are adding more IR framework testing for vectors. I think it is important to test them with different CPU's features **if needed**. > > https://github.com/openjdk/jdk/pull/8999 added filter to run some sub-tests if CPU's feature is present/absent. But it relies on testing infrastructure to be executed on corresponding machine. It is not reliable (how many machines left with only AVX1). > > I suggest to allow a test add flags to whitelist with which it allows to run. It (together with #8999 and `@requires`) will allow test's author to specify range of CPU features which can be used for test and make sure they will be run. Thanks for looking at this fix @vnkozlov ! > Including @chhagedorn for discussion. > > I was wondering why our testing with `UseAVX=1` did not catch this issue (test passed). But then I remember that IR framework skip testing if such flag is used by testing. It is not on whitelist: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java#L104 Right, testing with such vm options won't trigger the IR testing! > > We are adding more IR framework testing for vectors. I think it is important to test them with different CPU's features **if needed**. > > #8999 added filter to run some sub-tests if CPU's feature is present/absent. But it relies on testing infrastructure to be executed on corresponding machine. It is not reliable (how many machines left with only AVX1). > > I suggest to allow a test add flags to whitelist with which it allows to run. It (together with #8999 and `@requires`) will allow test's author to specify range of CPU features which can be used for test and make sure they will be run. Totally agree! #8999 adds the possibility to limit the test for specific CPU feature, which is convenient for the IR tests that relies on the CPU feature. Please also see one of the tests in the `VectorLogicalOpIdentityTest.java` here: https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/vectorapi/VectorLogicalOpIdentityTest.java#L160 But for the check of common IRs which is not the CPU specific ones, testing with different vm options are needed. And we cannot use `ApplyIfxxx` for the architecture specific options such as `UseAVX` or `UseSVE` in common tests (i.e. using "UseSVE" on x64 systems makes issue). So I agree that adding the needed flags to whitelist is better. ------------- PR: https://git.openjdk.org/jdk/pull/9373 From kvn at openjdk.org Wed Jul 6 05:57:32 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Jul 2022 05:57:32 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 10:19:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Save vect_type to ReductionNode and VectorMaskOpNode Results are good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9037 From kvn at openjdk.org Wed Jul 6 06:10:42 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 6 Jul 2022 06:10:42 GMT Subject: RFR: 8289604: compiler/vectorapi/VectorLogicalOpIdentityTest.java failed on x86 AVX1 system In-Reply-To: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> References: <2XOrrHylAft7P5DfwnNMdYbILAfTRaTrIpdWAK2PIJI=.c30dd7e1-d7b5-409f-94f6-db676b9841f9@github.com> Message-ID: On Tue, 5 Jul 2022 07:47:58 GMT, Xiaohong Gong wrote: > The sub-test "`testMaskAndZero()`" failed on x86 systems when > `UseAVX=1` with the IR check failure: > > - counts: Graph contains wrong number of nodes: > * Regex 1: (\\d+(\\s){2}(StoreVector.*)+(\\s){2}===.*) > - Failed comparison: [found] 0 >= 1 [given] > - No nodes matched! > > The root cause is the `VectorMask.fromArray/intoArray` APIs > are not intrinsified when "`UseAVX=1`" for long type vectors > with following reasons: > 1) The system supported max vector size is 128 bits for > integral vector operations when "`UseAVX=1`". > 2) The match rule of `VectorLoadMaskNode/VectorStoreMaskNode` > are not supported for vectors with 2 elements (see [1]). > > Note that `VectorMask.fromArray()` needs to be intrinsified > with "`LoadVector+VectorLoadMask`". And `VectorMask.intoArray()` > needs to be intrinsified with "`VectorStoreMask+StoreVector`". > Either "`VectorStoreMask`" or "`StoreVector`" not supported by the > compiler backend will forbit the relative API intrinsification. > > Replacing the vector type from Long to other integral types > in the test case can fix the issue. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/x86.ad#L1861 Finally few runs with failed tests were executed on AVX1 machine and they passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9373 From rpressler at openjdk.org Wed Jul 6 09:44:23 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Wed, 6 Jul 2022 09:44:23 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v4] In-Reply-To: References: Message-ID: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. Ron Pressler has updated the pull request incrementally with one additional commit since the last revision: Changes following review comments ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/66/files - new: https://git.openjdk.org/jdk19/pull/66/files/7323f635..43f18e73 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=66&range=02-03 Stats: 5 lines in 2 files changed: 2 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/66.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/66/head:pull/66 PR: https://git.openjdk.org/jdk19/pull/66 From jbhateja at openjdk.org Wed Jul 6 13:24:27 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:24:27 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: > Hi All, > > [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. > > X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. > > This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. > > Please find below the JMH micro stats with and without patch. > > > > System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] > > Baseline: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms > > > With Opt: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms > > > Predicated memory operation over sub-word type will be handled in a subsequent patch. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8289186: jcheck failure ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9324/files - new: https://git.openjdk.org/jdk/pull/9324/files/60a777ca..2551d741 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=02-03 Stats: 0 lines in 1 file changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9324.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9324/head:pull/9324 PR: https://git.openjdk.org/jdk/pull/9324 From rpressler at openjdk.org Wed Jul 6 08:59:46 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Wed, 6 Jul 2022 08:59:46 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 02:19:28 GMT, Dean Long wrote: >> Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: >> >> - Add an "i2i" entry to enterSpecial >> - Fix comment > > src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1058: > >> 1056: >> 1057: address mark = __ pc(); >> 1058: __ trampoline_call1(resolve, NULL, false); > > I don't think it's necessary to call the resolve stub when in interpreted mode. Can't we just call the Method's c2i adapter just like the interpreter would? I guess there might be a startup issue if the adapter hasn't been generated yet. I couldn't find code that does that and could be easily reused. > src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 1322: > >> 1320: >> 1321: __ pop(rax); // return address >> 1322: // Read interpreter arguments into registers (this is an ad-hoc i2c adapter) > > If I understand this correctly, this allows you to avoid creating an interpreted frame. Pretty clever! Yeah, it's a hand-rolled i2c adapter. I thought of just calling `gen_i2c_adapter` in place, but that would have required changing it. If we had two separate nmethods, we could rely on the standard i2c, but having two nmethods for a single Method didn't seem safe at this time. We can revisit later. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Wed Jul 6 02:01:46 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 02:01:46 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: <3VeghgUTahQwYWfJSEZMhXoX7rg_E7gES203lyu_HR0=.be938670-d295-463a-870f-9af8b5ff623c@github.com> On Tue, 5 Jul 2022 08:31:31 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: > > - Add an "i2i" entry to enterSpecial > - Fix comment src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1050: > 1048: OopMap* map = continuation_enter_setup(masm, stack_slots); > 1049: // The frame is complete here, but we only record it for the compiled entry, so the frame would appear unsafe, > 1050: // but that's okay because at the very worst we'll miss an async sample, but we're in interp_only_mode anyeay. Suggestion: // but that's okay because at the very worst we'll miss an async sample, but we're in interp_only_mode anyway. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Wed Jul 6 02:45:43 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 02:45:43 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 08:31:31 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: > > - Add an "i2i" entry to enterSpecial > - Fix comment src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 1322: > 1320: > 1321: __ pop(rax); // return address > 1322: // Read interpreter arguments into registers (this is an ad-hoc i2c adapter) If I understand this correctly, this allows you to avoid creating an interpreted frame. Pretty clever! ------------- PR: https://git.openjdk.org/jdk19/pull/66 From rpressler at openjdk.org Wed Jul 6 20:56:00 2022 From: rpressler at openjdk.org (Ron Pressler) Date: Wed, 6 Jul 2022 20:56:00 GMT Subject: [jdk19] Integrated: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:23:26 GMT, Ron Pressler wrote: > Please review the following bug fix: > > `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. > > Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. > > This change does three things: > > 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. > 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. > 3. In interp_only_mode, the c2i stub will not patch the callsite. > > This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 > > > Passes tiers 1-4 and Loom tiers 1-5. This pull request has now been integrated. Changeset: 9a0fa824 Author: Ron Pressler URL: https://git.openjdk.org/jdk19/commit/9a0fa8242461afe9ee4bcf80523af13500c9c1f2 Stats: 218 lines in 10 files changed: 189 ins; 10 del; 19 mod 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing Reviewed-by: dlong, eosterlund, rehn ------------- PR: https://git.openjdk.org/jdk19/pull/66 From dlong at openjdk.org Wed Jul 6 19:42:31 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 19:42:31 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: <2J4lDx3_qn257A-CyHpI-VZObigqY3rHDhAO3TLCL0Y=.e383cb33-2b1e-48ed-9e02-e43b9d3d5ed1@github.com> On Wed, 6 Jul 2022 08:56:49 GMT, Ron Pressler wrote: >> src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 1322: >> >>> 1320: >>> 1321: __ pop(rax); // return address >>> 1322: // Read interpreter arguments into registers (this is an ad-hoc i2c adapter) >> >> If I understand this correctly, this allows you to avoid creating an interpreted frame. Pretty clever! > > Yeah, it's a hand-rolled i2c adapter. I thought of just calling `gen_i2c_adapter` in place, but that would have required changing it. If we had two separate nmethods, we could rely on the standard i2c, but having two nmethods for a single Method didn't seem safe at this time. We can revisit later. I don't think the interpreter version needs to be an nmethod. It could be more like the intrinsics generated by TemplateInterpreterGenerator::generate_method_entry, but let's revisit after jdk19. ------------- PR: https://git.openjdk.org/jdk19/pull/66 From mdoerr at openjdk.org Thu Jul 7 08:13:53 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 7 Jul 2022 08:13:53 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v3] In-Reply-To: References: Message-ID: <4npz2hop627nZ4L41HlihXbwbxAOFwr9R4837X6nyYM=.bc6f8b7c-0bdc-4940-ae5d-9efd0fbaa427@github.com> > We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Revert whitespace change. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9403/files - new: https://git.openjdk.org/jdk/pull/9403/files/58cefd26..a11a20c9 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9403&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9403&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9403.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9403/head:pull/9403 PR: https://git.openjdk.org/jdk/pull/9403 From mdoerr at openjdk.org Thu Jul 7 07:59:16 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 7 Jul 2022 07:59:16 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v2] In-Reply-To: References: Message-ID: > We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Use VMRegImpl::Bad() for vector registers and remove null check again. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9403/files - new: https://git.openjdk.org/jdk/pull/9403/files/15ca35ae..58cefd26 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9403&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9403&range=00-01 Stats: 66 lines in 2 files changed: 0 ins; 0 del; 66 mod Patch: https://git.openjdk.org/jdk/pull/9403.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9403/head:pull/9403 PR: https://git.openjdk.org/jdk/pull/9403 From ngasson at openjdk.org Thu Jul 7 08:09:33 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Thu, 7 Jul 2022 08:09:33 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: On Mon, 4 Jul 2022 10:19:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Save vect_type to ReductionNode and VectorMaskOpNode Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9037 From mdoerr at openjdk.org Wed Jul 6 20:56:04 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Wed, 6 Jul 2022 20:56:04 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 Message-ID: We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. ------------- Commit messages: - 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 Changes: https://git.openjdk.org/jdk/pull/9403/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9403&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289856 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9403.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9403/head:pull/9403 PR: https://git.openjdk.org/jdk/pull/9403 From xgong at openjdk.org Thu Jul 7 08:19:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 7 Jul 2022 08:19:53 GMT Subject: Integrated: 8286941: Add mask IR for partial vector operations for ARM SVE In-Reply-To: References: Message-ID: On Mon, 6 Jun 2022 09:42:02 GMT, Xiaohong Gong wrote: > VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. > > For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. > > Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. > > Here is an example for vector load and add reduction inside a loop: > > ptrue p0.s, vl8 ; mask generation > ld1w {z16.s}, p0/z, [x14] ; load vector > > ptrue p0.s, vl8 ; mask generation > uaddv d17, p0, z16.s ; add reduction > smov x14, v17.s[0] > > As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. > > Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. > > Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: > > Benchmark size Gain > Byte256Vector.ADDLanes 1024 0.999 > Byte256Vector.ANDLanes 1024 1.065 > Byte256Vector.MAXLanes 1024 1.064 > Byte256Vector.MINLanes 1024 1.062 > Byte256Vector.ORLanes 1024 1.072 > Byte256Vector.XORLanes 1024 1.041 > Short256Vector.ADDLanes 1024 1.017 > Short256Vector.ANDLanes 1024 1.044 > Short256Vector.MAXLanes 1024 1.049 > Short256Vector.MINLanes 1024 1.049 > Short256Vector.ORLanes 1024 1.089 > Short256Vector.XORLanes 1024 1.047 > Int256Vector.ADDLanes 1024 1.045 > Int256Vector.ANDLanes 1024 1.078 > Int256Vector.MAXLanes 1024 1.123 > Int256Vector.MINLanes 1024 1.129 > Int256Vector.ORLanes 1024 1.078 > Int256Vector.XORLanes 1024 1.072 > Long256Vector.ADDLanes 1024 1.059 > Long256Vector.ANDLanes 1024 1.101 > Long256Vector.MAXLanes 1024 1.079 > Long256Vector.MINLanes 1024 1.099 > Long256Vector.ORLanes 1024 1.098 > Long256Vector.XORLanes 1024 1.110 > Float256Vector.ADDLanes 1024 1.033 > Float256Vector.MAXLanes 1024 1.156 > Float256Vector.MINLanes 1024 1.151 > Double256Vector.ADDLanes 1024 1.062 > Double256Vector.MAXLanes 1024 1.145 > Double256Vector.MINLanes 1024 1.140 > > This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: > > sxtw x14, w14 > whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 This pull request has now been integrated. Changeset: a79ce4e7 Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/a79ce4e74858e78acc83c12d500303f667dc3f6b Stats: 2042 lines in 19 files changed: 791 ins; 829 del; 422 mod 8286941: Add mask IR for partial vector operations for ARM SVE Reviewed-by: kvn, jbhateja, njian, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/9037 From xgong at openjdk.org Thu Jul 7 06:12:29 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 7 Jul 2022 06:12:29 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: <9ChNCqvi-YNcx6n5OcEQl7MzvOLrARNrqCH1OY73mZE=.9d6de2cd-5c98-4d27-aa0b-20922f54b53b@github.com> On Mon, 4 Jul 2022 10:19:36 GMT, Xiaohong Gong wrote: >> VectorAPI SVE backend supports vector operations whose vector length is smaller than the max vector length that the current hardware can support. We call them partial vector operations. For some partial operations like vector load/store and the reductions, we need to generate a mask based on the real vector length and use it to control the operations to make sure the results are correct. >> >> For example, if the user defines an IntVector with 256-bit species, and runs it on a SVE hardware that supports 512-bit as the max vector size, all the 256-bit int vector operations are partial. And a mask that all the higher lanes than the real vector length are set to 0 is generated for some ops. >> >> Currently the mask is generated in the backend that is together with the code generation for each op in the match rule. This will generate many duplicate instructions for operations that have the same vector type. Besides, the mask generation is loop invariant which could be hoisted outside of the loop. >> >> Here is an example for vector load and add reduction inside a loop: >> >> ptrue p0.s, vl8 ; mask generation >> ld1w {z16.s}, p0/z, [x14] ; load vector >> >> ptrue p0.s, vl8 ; mask generation >> uaddv d17, p0, z16.s ; add reduction >> smov x14, v17.s[0] >> >> As we can see the mask generation code "`ptrue`" is duplicated. To improve it, this patch generates the mask IR and adds it to the partial vector ops before code generation. The duplicate mask generation instructions can be optimized out by gvn and hoisted outside of the loop. >> >> Note that for masked vector operations, there is no need to generate additional mask even though the vector length is smaller than the max vector register size, as the original higher input mask bits have been cleared out. >> >> Here is the performance gain for the 256-bit vector reductions work on an SVE 512-bit system: >> >> Benchmark size Gain >> Byte256Vector.ADDLanes 1024 0.999 >> Byte256Vector.ANDLanes 1024 1.065 >> Byte256Vector.MAXLanes 1024 1.064 >> Byte256Vector.MINLanes 1024 1.062 >> Byte256Vector.ORLanes 1024 1.072 >> Byte256Vector.XORLanes 1024 1.041 >> Short256Vector.ADDLanes 1024 1.017 >> Short256Vector.ANDLanes 1024 1.044 >> Short256Vector.MAXLanes 1024 1.049 >> Short256Vector.MINLanes 1024 1.049 >> Short256Vector.ORLanes 1024 1.089 >> Short256Vector.XORLanes 1024 1.047 >> Int256Vector.ADDLanes 1024 1.045 >> Int256Vector.ANDLanes 1024 1.078 >> Int256Vector.MAXLanes 1024 1.123 >> Int256Vector.MINLanes 1024 1.129 >> Int256Vector.ORLanes 1024 1.078 >> Int256Vector.XORLanes 1024 1.072 >> Long256Vector.ADDLanes 1024 1.059 >> Long256Vector.ANDLanes 1024 1.101 >> Long256Vector.MAXLanes 1024 1.079 >> Long256Vector.MINLanes 1024 1.099 >> Long256Vector.ORLanes 1024 1.098 >> Long256Vector.XORLanes 1024 1.110 >> Float256Vector.ADDLanes 1024 1.033 >> Float256Vector.MAXLanes 1024 1.156 >> Float256Vector.MINLanes 1024 1.151 >> Double256Vector.ADDLanes 1024 1.062 >> Double256Vector.MAXLanes 1024 1.145 >> Double256Vector.MINLanes 1024 1.140 >> >> This patch also adds 32-bit variants of SVE whileXX instruction with one more matching rule of `VectorMaskGen (ConvI2L src)`. So after this patch, we save one `sxtw` instruction for most VectorMaskGen cases, like below: >> >> sxtw x14, w14 >> whilelo p0.s, xzr, x14 => whilelo p0.s, wzr, w14 > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Save vect_type to ReductionNode and VectorMaskOpNode @nick-arm , could you please take a look at this change again? Thanks so much! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From mdoerr at openjdk.org Thu Jul 7 07:59:17 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 7 Jul 2022 07:59:17 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v2] In-Reply-To: References: Message-ID: On Thu, 7 Jul 2022 00:16:18 GMT, Dean Long wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Use VMRegImpl::Bad() for vector registers and remove null check again. > > src/hotspot/share/opto/c2compiler.cpp line 69: > >> 67: for (OptoReg::Name i=OptoReg::Name(0); i> 68: VMReg r = OptoReg::as_VMReg(i); >> 69: if (r != nullptr && r->is_valid()) { > > Instead of changing shared code, how about changing ppc.ad to use VMRegImpl::Bad() instead of NULL? That makes more sense. I've changed it. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9403 From mdoerr at openjdk.org Thu Jul 7 08:13:55 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 7 Jul 2022 08:13:55 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v2] In-Reply-To: References: Message-ID: On Thu, 7 Jul 2022 08:00:45 GMT, Dean Long wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Use VMRegImpl::Bad() for vector registers and remove null check again. > > src/hotspot/share/opto/c2compiler.cpp line 67: > >> 65: } >> 66: >> 67: for (OptoReg::Name i=OptoReg::Name(0); i > I don't think it's worth it to change this line just because of white space. Removed. Thanks for the prompt review! ------------- PR: https://git.openjdk.org/jdk/pull/9403 From xgong at openjdk.org Thu Jul 7 02:48:33 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 7 Jul 2022 02:48:33 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 13:24:27 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8289186: jcheck failure Common IR changes look good to me! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/9324 From xgong at openjdk.org Thu Jul 7 08:09:33 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 7 Jul 2022 08:09:33 GMT Subject: RFR: 8286941: Add mask IR for partial vector operations for ARM SVE [v9] In-Reply-To: References: Message-ID: On Thu, 7 Jul 2022 08:04:44 GMT, Nick Gasson wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Save vect_type to ReductionNode and VectorMaskOpNode > > Marked as reviewed by ngasson (Reviewer). Thanks for the reivew @nick-arm ! ------------- PR: https://git.openjdk.org/jdk/pull/9037 From dlong at openjdk.org Thu Jul 7 00:19:40 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 7 Jul 2022 00:19:40 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 20:48:14 GMT, Martin Doerr wrote: > We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. src/hotspot/share/opto/c2compiler.cpp line 69: > 67: for (OptoReg::Name i=OptoReg::Name(0); i 68: VMReg r = OptoReg::as_VMReg(i); > 69: if (r != nullptr && r->is_valid()) { Instead of changing shared code, how about changing ppc.ad to use VMRegImpl::Bad() instead of NULL? ------------- PR: https://git.openjdk.org/jdk/pull/9403 From dlong at openjdk.org Wed Jul 6 02:18:45 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 6 Jul 2022 02:18:45 GMT Subject: [jdk19] RFR: 8288949: serviceability/jvmti/vthread/ContStackDepthTest/ContStackDepthTest.java failing [v3] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 08:31:31 GMT, Ron Pressler wrote: >> Please review the following bug fix: >> >> `Continuation.enterSpecial` is a generated special nmethod (albeit not a Java method), with a well-known frame layout that calls `Continuation.enter`. >> >> Because it is compiled, it resolves the call to `Continuation.enter` to its compiled version, if available. But this results in the compiled `Continuation.enter` being called even when the thread is in interp_only_mode. >> >> This change does three things: >> >> 1. When entering interp_only_mode, `Continuation::set_cont_fastpath_thread_state` will clear enterSpecial's resolved callsite to Continuation.enter. >> 2. In interp_only_mode, `SharedRuntime::resolve_static_call_C` will return `Continuation.enter`'s c2i entry rather than `verified_code_entry`. >> 3. In interp_only_mode, the c2i stub will not patch the callsite. >> >> This fix isn't perfect, because a different thread, not in interp_only_mode, might patch the call. A longer-term solution is to create an "interpreted" version of `enterSpecial` and supporting an ad-hoc deoptimization. See https://bugs.openjdk.org/browse/JDK-8289128 >> >> >> Passes tiers 1-4 and Loom tiers 1-5. > > Ron Pressler has updated the pull request incrementally with two additional commits since the last revision: > > - Add an "i2i" entry to enterSpecial > - Fix comment src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1222: > 1220: OopMapSet* oop_maps = new OopMapSet(); > 1221: int interpreted_entry_offset = -1; > 1222: int compiled_entry_offset = -1; `compiled_entry_offset` is unsed src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 1535: > 1533: OopMapSet* oop_maps = new OopMapSet(); > 1534: int interpreted_entry_offset = -1; > 1535: int compiled_entry_offset = -1; `compiled_entry_offset` is unsed ------------- PR: https://git.openjdk.org/jdk19/pull/66 From jbhateja at openjdk.org Wed Jul 6 13:17:57 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 6 Jul 2022 13:17:57 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v3] In-Reply-To: References: Message-ID: <9HXT1kEu2HOXzWGmJwQTPo13umP4cCeM8ywGA-lwdVg=.ebd2f0b2-cae4-46ad-b81e-8a4a90f15389@github.com> > Hi All, > > [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. > > X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. > > This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. > > Please find below the JMH micro stats with and without patch. > > > > System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] > > Baseline: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms > > > With Opt: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms > > > Predicated memory operation over sub-word type will be handled in a subsequent patch. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8289186: Review comments resolved. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9324/files - new: https://git.openjdk.org/jdk/pull/9324/files/b3c193f4..60a777ca Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=01-02 Stats: 167 lines in 12 files changed: 134 ins; 26 del; 7 mod Patch: https://git.openjdk.org/jdk/pull/9324.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9324/head:pull/9324 PR: https://git.openjdk.org/jdk/pull/9324 From dlong at openjdk.org Thu Jul 7 08:04:37 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 7 Jul 2022 08:04:37 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v2] In-Reply-To: References: Message-ID: On Thu, 7 Jul 2022 07:59:16 GMT, Martin Doerr wrote: >> We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Use VMRegImpl::Bad() for vector registers and remove null check again. Marked as reviewed by dlong (Reviewer). src/hotspot/share/opto/c2compiler.cpp line 67: > 65: } > 66: > 67: for (OptoReg::Name i=OptoReg::Name(0); i References: Message-ID: On Tue, 5 Jul 2022 18:41:50 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8289186: Review comments resolved. > > src/hotspot/share/opto/vectorIntrinsics.cpp line 313: > >> 311: if (!is_supported && (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked)) { >> 312: return true; >> 313: } > > Still unclear for me. As I understand `the upfront checks`, you mention, are checks at lines 270 and 286. So we know that `StoreVectorMasked` and `LoadVectorMasked` are supported. > The second part of your comment is talking about `non-predicated targets`. Does it mean that the real check should be next?: > > if (Matcher::has_predicated_vectors()) { > .... > } else if (sopc == Op_StoreVectorMasked || sopc == Op_LoadVectorMasked) { > // your comment > return true; > } > ``` > Or I am still missing what are you trying to do here? Could we change the `VectorMaskUseType` here https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/vectorIntrinsics.cpp#L1221 instead of doing the modification here? Originally we assume `LoadVectorMasked/StoreVectorMasked` is implemented with predicated feature, we added the `VectorMaskUsePred`. If the assumption is changed now, we can just remove `VectorMaskUsePred` and so only `match_rule_supported_vector` is checked for such ops. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From mdoerr at openjdk.org Thu Jul 7 10:26:40 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 7 Jul 2022 10:26:40 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v3] In-Reply-To: <4npz2hop627nZ4L41HlihXbwbxAOFwr9R4837X6nyYM=.bc6f8b7c-0bdc-4940-ae5d-9efd0fbaa427@github.com> References: <4npz2hop627nZ4L41HlihXbwbxAOFwr9R4837X6nyYM=.bc6f8b7c-0bdc-4940-ae5d-9efd0fbaa427@github.com> Message-ID: On Thu, 7 Jul 2022 08:13:53 GMT, Martin Doerr wrote: >> We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Revert whitespace change. Thanks for the review! ------------- PR: https://git.openjdk.org/jdk/pull/9403 From mdoerr at openjdk.org Thu Jul 7 10:26:42 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 7 Jul 2022 10:26:42 GMT Subject: Integrated: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 20:48:14 GMT, Martin Doerr wrote: > We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. This pull request has now been integrated. Changeset: e05b2f2c Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/e05b2f2c3b9b0276099766bc38a55ff835c989e1 Stats: 65 lines in 1 file changed: 0 ins; 0 del; 65 mod 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 Reviewed-by: dlong, lucy ------------- PR: https://git.openjdk.org/jdk/pull/9403 From lucy at openjdk.org Thu Jul 7 09:38:45 2022 From: lucy at openjdk.org (Lutz Schmidt) Date: Thu, 7 Jul 2022 09:38:45 GMT Subject: RFR: 8289856: [PPC64] SIGSEGV in C2Compiler::init_c2_runtime() after JDK-8289060 [v3] In-Reply-To: <4npz2hop627nZ4L41HlihXbwbxAOFwr9R4837X6nyYM=.bc6f8b7c-0bdc-4940-ae5d-9efd0fbaa427@github.com> References: <4npz2hop627nZ4L41HlihXbwbxAOFwr9R4837X6nyYM=.bc6f8b7c-0bdc-4940-ae5d-9efd0fbaa427@github.com> Message-ID: On Thu, 7 Jul 2022 08:13:53 GMT, Martin Doerr wrote: >> We're currently calling `nullptr->is_valid()` and `nullptr->value()` which causes SIGSEGV on PPC64 (and is undefined behavior). See JBS for details. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Revert whitespace change. LGTM Thanks for fixing! ------------- Marked as reviewed by lucy (Reviewer). PR: https://git.openjdk.org/jdk/pull/9403 From kvn at openjdk.org Thu Jul 7 19:04:54 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Jul 2022 19:04:54 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 13:24:27 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8289186: jcheck failure Good. Lat me test it. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From kvn at openjdk.org Thu Jul 7 19:01:47 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Jul 2022 19:01:47 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v2] In-Reply-To: References: Message-ID: <6SCt5nj7said61mowAYvSRsb8boj3p4T6xr3tok3QGQ=.4a4c1bdd-f0fa-4147-9056-47e882e868fa@github.com> On Wed, 6 Jul 2022 13:13:08 GMT, Jatin Bhateja wrote: >> src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3923: >> >>> 3921: // End of low-level memory operations. >>> 3922: >>> 3923: @ForceInline >> >> Why this change? Was it missing before or you found that based on testing? >> What is criteria to add `@ForceInline`? > > Thanks for highlighting this, checkMaskFromIndexSize is being used to test illegal memory access cases with out-of-range offsets i.e. tail scenarios, thus profile based invocation count will always be low for these calls. I saw improved performance on targeted micros, but I get your point that aggressive forced in-lining on non-frequently taken paths can have adverse performance side-effects. But then it may overshadow some of the performance gains due to masked load/strores support on tail paths, but we still see a modest gain in order of 2-3x vs original 10x gain for non-sub word types over baseline. > > > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 748.793 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 381.655 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 741.809 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 757.433 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 386.450 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 471.260 ops/ms okay. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From duke at openjdk.org Thu Jul 7 20:01:34 2022 From: duke at openjdk.org (Cesar Soares) Date: Thu, 7 Jul 2022 20:01:34 GMT Subject: RFR: 8289943: Simplify some object allocation merges Message-ID: Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). 2) Scalar Replace the incoming allocations to the RAM node. 3) Scalar Replace the RAM node itself. There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: - ~~The original Phi node should be merging Allocate nodes in all inputs.~~ - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. - The way I check if there is an incoming Allocate node to the original Phi node. - The way I check if there is no store to the merged objects after they are merged. Testing: - Linux. fastdebug -> hotspot_all, renaissance, dacapo ------------- Commit messages: - Lift requirement for all inputs to be Allocate. - fix formatting - merge fix - work - work - work - work - work - work - work - ... and 14 more: https://git.openjdk.org/jdk/compare/68c5957b...e1d506f3 Changes: https://git.openjdk.org/jdk/pull/9073/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9073&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289943 Stats: 1718 lines in 21 files changed: 1670 ins; 6 del; 42 mod Patch: https://git.openjdk.org/jdk/pull/9073.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9073/head:pull/9073 PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Thu Jul 7 18:57:33 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Jul 2022 18:57:33 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 07:51:01 GMT, Fei Gao wrote: > Superword doesn't vectorize any nodes of non-primitive types and > thus sets `allow_address` false when calling type2aelembytes() in > SuperWord::data_size()[1]. Therefore, when we try to resolve the > data size for a node of T_ADDRESS type, the assertion in > type2aelembytes()[2] takes effect. > > We try to resolve the data sizes for node s and node t in the > SuperWord::adjust_alignment_for_type_conversion()[3] when type > conversion between different data sizes happens. The issue is, > when node s is a ConvI2L node and node t is an AddP node of > T_ADDRESS type, type2aelembytes() will assert. To fix it, we > should filter out all non-primitive nodes, like the patch does > in SuperWord::adjust_alignment_for_type_conversion(). Since > it's a failure in the mid-end, all superword available platforms > are affected. In my local test, this failure can be reproduced > on both x86 and aarch64. With this patch, the failure can be fixed. > > Apart from fixing the bug, the patch also adds necessary type check > and does some clean-up in SuperWord::longer_type_for_conversion() > and VectorCastNode::implemented(). > > [1]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1417 > [2]https://github.com/openjdk/jdk/blob/b96ba19807845739b36274efb168dd048db819a3/src/hotspot/share/utilities/globalDefinitions.cpp#L326 > [3]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1454 In which call to `adjust_alignment_for_type_conversion()` you got AddP node? Should we add checks there too? ------------- PR: https://git.openjdk.org/jdk/pull/9391 From duke at openjdk.org Thu Jul 7 20:01:34 2022 From: duke at openjdk.org (Cesar Soares) Date: Thu, 7 Jul 2022 20:01:34 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: <03QTzUvnv4sJVwZm6nOX7AGX1lEaCftBo8phytf7sfk=.4c91027d-a69e-4028-bc85-e970ac2546d6@github.com> On Tue, 7 Jun 2022 23:24:02 GMT, Cesar Soares wrote: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - ~~The original Phi node should be merging Allocate nodes in all inputs.~~ > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Linux. fastdebug -> hotspot_all, renaissance, dacapo Hi there, can someone please take a look and let me know if this is going in a reasonable direction? @vnkozlov - is this more or less on the lines of what you were thinking? ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Thu Jul 7 19:01:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Jul 2022 19:01:44 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 13:13:04 GMT, Jatin Bhateja wrote: >> Does this mean the `mask` input can be a normal `vect_type ` like the mask input of `VectorBlend` for `LoadVectorMasked/StoreVectorMasked` over X86 AVX2 systems? They do not depend on the predicated feature for the AVX2 systems, right? > > IR was specifically added for predicated targets (AVX512 and ARM's SVE), this patch is re-using this IR for non-predicated AVX2 target where mask could be a vector type. Got it. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From kvn at openjdk.org Thu Jul 7 19:12:49 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 7 Jul 2022 19:12:49 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v4] In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 13:24:27 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8289186: jcheck failure @jatin-bhateja, please update it to latest sources. I have conflict while applying patch. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From duke at openjdk.org Fri Jul 8 03:10:59 2022 From: duke at openjdk.org (duke) Date: Fri, 8 Jul 2022 03:10:59 GMT Subject: Withdrawn: 8283232: x86: Improve vector broadcast operations In-Reply-To: References: Message-ID: <3kVB6o7RASf4cTtldfYCrl8g2zufvlljCyLYqhbT-Yg=.53fd3656-9e2e-4bf5-b6a2-b28477c6e0e8@github.com> On Wed, 16 Mar 2022 01:19:24 GMT, Quan Anh Mai wrote: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - Avoid potential data bypass delay which can be observed on some platforms by using the correct type of instruction if it does not require extra instructions. > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > > This patch also removes some redundant code paths and rename some incorrectly named instructions. > > Thank you very much. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From fgao at openjdk.org Fri Jul 8 01:48:41 2022 From: fgao at openjdk.org (Fei Gao) Date: Fri, 8 Jul 2022 01:48:41 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: On Thu, 7 Jul 2022 18:54:06 GMT, Vladimir Kozlov wrote: > In which call to `adjust_alignment_for_type_conversion()` you got AddP node? Should we add checks there too? Thanks for your review, @vnkozlov . When we called `adjust_alignment_for_type_conversion()` in `SuperWord::follow_def_uses()`, https://github.com/openjdk/jdk/blob/3f1174aa4709aabcfde8b40deec88b8ed466cc06/src/hotspot/share/opto/superword.cpp#L1525, we got AddP node. In this function, we also call `stmts_can_pack()` on the next line, which has checks to prevent unwanted pairs, https://github.com/openjdk/jdk/blob/3f1174aa4709aabcfde8b40deec88b8ed466cc06/src/hotspot/share/opto/superword.cpp#L1202. Maybe we don't have to add one more. WDYT? ------------- PR: https://git.openjdk.org/jdk/pull/9391 From duke at openjdk.org Fri Jul 8 00:15:20 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Fri, 8 Jul 2022 00:15:20 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v4] In-Reply-To: References: Message-ID: > 8263377: Store method handle linkers in the 'non-nmethods' heap Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Post dynamic_code_generate event when MH intrinsic generated - Remove dead codes remove unused argument of NativeJump::check_verified_entry_alignment remove unused argument of NativeJumip::patch_verified_entry remove dead codes in SharedRuntime::generate_method_handle_intrinsic_wrapper - Add PrintCodeCache support - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Move to RuntimeBlob - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Move MHI to BufferBlob - Change _code to CodeBlob - ... and 8 more: https://git.openjdk.org/jdk/compare/35156041...d92b8647 ------------- Changes: https://git.openjdk.org/jdk/pull/8760/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=8760&range=03 Stats: 588 lines in 58 files changed: 279 ins; 176 del; 133 mod Patch: https://git.openjdk.org/jdk/pull/8760.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8760/head:pull/8760 PR: https://git.openjdk.org/jdk/pull/8760 From jbhateja at openjdk.org Fri Jul 8 08:38:36 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 8 Jul 2022 08:38:36 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v5] In-Reply-To: References: Message-ID: > Hi All, > > [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. > > X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. > > This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. > > Please find below the JMH micro stats with and without patch. > > > > System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] > > Baseline: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms > > > With Opt: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms > > > Predicated memory operation over sub-word type will be handled in a subsequent patch. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8289186 - 8289186: jcheck failure - 8289186: Review comments resolved. - 8289186: Review comments resolved. - 8289186: Support predicated vector load/store operations over X86 AVX2 targets. ------------- Changes: https://git.openjdk.org/jdk/pull/9324/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9324&range=04 Stats: 330 lines in 8 files changed: 290 ins; 39 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9324.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9324/head:pull/9324 PR: https://git.openjdk.org/jdk/pull/9324 From epeter at openjdk.org Fri Jul 8 14:44:38 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Jul 2022 14:44:38 GMT Subject: RFR: 8288897: Clean up node dump code [v3] In-Reply-To: <6IwOmnPSo59TyFBjrN_xGvQCoOL-0z7KmGLJ0WD64jI=.77dec7b9-7cff-4c57-b748-5267908f4d37@github.com> References: <6IwOmnPSo59TyFBjrN_xGvQCoOL-0z7KmGLJ0WD64jI=.77dec7b9-7cff-4c57-b748-5267908f4d37@github.com> Message-ID: On Mon, 27 Jun 2022 07:01:41 GMT, Christian Hagedorn wrote: >> I did think about renaming it do `dump_...`. But then I also find it important that the name says that we do filter / search / find. > > The filter/search action is probably implied but I don't have a strong opinion about it - it's fine to leave the name like that. But I suggest to make it plural (`find_nodes_by_dump()`) as we are possibly returning multiple nodes. I like the idea with the plural. But I will keep the `find` ------------- PR: https://git.openjdk.org/jdk/pull/9234 From rrich at openjdk.org Fri Jul 8 13:38:05 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Fri, 8 Jul 2022 13:38:05 GMT Subject: RFR: 8289925 Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() Message-ID: This removes the reference to the platform specific method `frame::interpreter_frame_last_sp()` from the shared method `Continuation::continuation_bottom_sender()`. The change simply removes the special case for interpreted frames as I cannot see a reason for the distinction between interpreted and compiled frames. Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. ------------- Commit messages: - Remove platform dependent method interpreter_frame_last_sp() from shared code Changes: https://git.openjdk.org/jdk/pull/9411/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9411&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289925 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9411.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9411/head:pull/9411 PR: https://git.openjdk.org/jdk/pull/9411 From epeter at openjdk.org Fri Jul 8 14:39:32 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Jul 2022 14:39:32 GMT Subject: RFR: 8288897: Clean up node dump code [v3] In-Reply-To: References: Message-ID: > I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). > > This change sets cleans up the code around, and tries to reduce code duplication. > > Things I did: > - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. > - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). > - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. > - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private > - Adding `call from debugger` comment to VM functions that are useful in debugger Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'master' into JDK-8288897 - Apply suggestions from code review 2 style fixes by Christian Co-authored-by: Christian Hagedorn - cleanup, move debug functions to cpp to prevent inlining, add comment for debugger functions - make dump_bfs const, change datastructures, change some signatures to const - refactor dump to use dump_bfs, redefine categories through output types - 8288897: Clean up dump code for nodes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9234/files - new: https://git.openjdk.org/jdk/pull/9234/files/1a836616..b2a2f58f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=01-02 Stats: 71737 lines in 1313 files changed: 42386 ins; 13502 del; 15849 mod Patch: https://git.openjdk.org/jdk/pull/9234.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9234/head:pull/9234 PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Fri Jul 8 15:47:18 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Jul 2022 15:47:18 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:20:43 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> implementing Christians review suggestions > > src/hotspot/share/opto/node.cpp line 2205: > >> 2203: } >> 2204: >> 2205: bool PrintBFS::filter_category(const Node* n, Filter& filter) { > > Maybe you could add a method comment that you are not filtering on the category for `Mixed` but actually look at the outputs of it and also consider `is_CFG()`. done > src/hotspot/share/opto/node.cpp line 2220: > >> 2218: } >> 2219: if (filter._other && t->has_category(Type::Category::Other)) { >> 2220: return true; > > Just a suggestion: To make it clear that you are only special casing `Mixed` you could leave the `switch` statement and only do the additional checks for `Mixed`. Since this category check is specific to the filtering of `dump_bfs()` and not something you normally perform on a type, I suggest to move this function to the `Filter` class (if that's possible). This would also require to change the implementation of `has_category()` - if it's too complicated, just leave it as it is. It's fine like that. I now made a single if statement out of it, and moved it to `Filter`. It was not evident for me how to use `switch,` because both the filter and the node can have multiple categories. ------------- PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Fri Jul 8 15:49:23 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Jul 2022 15:49:23 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:59:55 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> implementing Christians review suggestions > > Otherwise, nice cleanup! I think it's the right thing to remove unused and unmaintained `dump` methods and reduce code duplication. > > Have you checked that the printed node order with `dump(X)` is the same as before? I'm not sure if that is a strong requirement. I'm just thinking about `PrintIdeal` with which we do: > https://github.com/openjdk/jdk/blob/17aacde50fb971bc686825772e29f6bfecadabda/src/hotspot/share/opto/compile.cpp#L554 > > Some tools/scripts might depend on the previous order of `dump(X)`. But I'm currently not aware of any such order-dependent processing. For the IR framework, the node order does not matter and if I see that correctly, the dump of an individual node is the same as before. So, it should be fine. @chhagedorn > Have you checked that the printed node order with `dump(X)` is the same as before? I'm not sure if that is a strong requirement. I did try to make sure that the output of `dump` stays equivalent. As far as I manually inspected, they are. The visit order is the same, and the same nodes are dumped. ------------- PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Fri Jul 8 15:47:17 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Fri, 8 Jul 2022 15:47:17 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: References: Message-ID: > I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). > > This change sets cleans up the code around, and tries to reduce code duplication. > > Things I did: > - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. > - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). > - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. > - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private > - Adding `call from debugger` comment to VM functions that are useful in debugger Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: implementing Christians review suggestions ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9234/files - new: https://git.openjdk.org/jdk/pull/9234/files/b2a2f58f..a95b1260 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=02-03 Stats: 45 lines in 1 file changed: 10 ins; 22 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/9234.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9234/head:pull/9234 PR: https://git.openjdk.org/jdk/pull/9234 From duke at openjdk.org Fri Jul 8 18:56:48 2022 From: duke at openjdk.org (duke) Date: Fri, 8 Jul 2022 18:56:48 GMT Subject: Withdrawn: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size In-Reply-To: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: On Mon, 28 Mar 2022 16:39:08 GMT, Eric Liu wrote: > This patch speeds up add/mul/min/max reductions for SVE for 64/128 > vector size. > > According to Neoverse N2/V1 software optimization guide[1][2], for > 128-bit vector size reduction operations, we prefer using NEON > instructions instead of SVE instructions. This patch adds some rules to > distinguish 64/128 bits vector size with others, so that for these two > special cases, they can generate code the same as NEON. E.g., For > ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" > generates code as below: > > > Before: > uaddv d17, p0, z16.b > smov x15, v17.b[0] > add w15, w14, w15, sxtb > > After: > addv b17, v16.16b > smov x12, v17.b[0] > add w12, w12, w16, sxtb > > No multiply reduction instruction in SVE, this patch generates code for > MulReductionVL by using scalar insnstructions for 128-bit vector size. > > With this patch, all of them have performance gain for specific vector > micro benchmarks in my SVE testing system. > > [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ > [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 > > Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/7999 From kvn at openjdk.org Fri Jul 8 20:14:42 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Jul 2022 20:14:42 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v5] In-Reply-To: References: Message-ID: <2s4pqMPNlfWxBbAlMIEN6XrM9GXJpjY8LOvPPO4OFBk=.06eb1a6e-151e-43a8-8c1f-3b16bd67a92b@github.com> On Fri, 8 Jul 2022 08:38:36 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8289186 > - 8289186: jcheck failure > - 8289186: Review comments resolved. > - 8289186: Review comments resolved. > - 8289186: Support predicated vector load/store operations over X86 AVX2 targets. I started testing. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From kvn at openjdk.org Fri Jul 8 20:35:37 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Jul 2022 20:35:37 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: <6KltnZFZDqf5kuIjF_t0ns-DVjqSaLj8kFfBxS6rwt0=.aa0372ae-3c0b-4b00-b005-5431a113c9f4@github.com> On Fri, 8 Jul 2022 01:43:11 GMT, Fei Gao wrote: >> In which call to `adjust_alignment_for_type_conversion()` you got AddP node? >> Should we add checks there too? > >> In which call to `adjust_alignment_for_type_conversion()` you got AddP node? Should we add checks there too? > > Thanks for your review, @vnkozlov . > > When we called `adjust_alignment_for_type_conversion()` in `SuperWord::follow_def_uses()`, https://github.com/openjdk/jdk/blob/3f1174aa4709aabcfde8b40deec88b8ed466cc06/src/hotspot/share/opto/superword.cpp#L1525, we got AddP node. In this function, we also call `stmts_can_pack()` on the next line, which has checks to prevent unwanted pairs, https://github.com/openjdk/jdk/blob/3f1174aa4709aabcfde8b40deec88b8ed466cc06/src/hotspot/share/opto/superword.cpp#L1202. Maybe we don't have to add one more. WDYT? @fg1417 `stmts_can_pack()` is called in an other place which is preceded by `are_adjacent_refs()` call which also has primitive type check (but different). I was thinking to convert checks in `stmts_can_pack()` to `assert`. But, on other hand, `is_java_primitive(bt)` is cheap and I would prefer to keep checks in `stmts_can_pack()` as they are in case we call it in an other place. Anyway. After looking on code I agree with your current changes. Let me test it. And you need second review. ------------- PR: https://git.openjdk.org/jdk/pull/9391 From jbhateja at openjdk.org Fri Jul 8 22:17:29 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 8 Jul 2022 22:17:29 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() Message-ID: [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for Reverse Byte vector operations. Auto-vectorization analysis based on vector IR opcode existence and target backed implementation for existing Java SE APIs [Short/Character/Integer/Long].reverseBytes passes since both vector IR and backend support already exist for these operations. This bug fix patch handled missing scalar reverse byte IR cases in SLP optimizer to enable creation of corresponding vector IR nodes. A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() Changes: https://git.openjdk.org/jdk19/pull/128/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288112 Stats: 213 lines in 9 files changed: 206 ins; 0 del; 7 mod Patch: https://git.openjdk.org/jdk19/pull/128.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/128/head:pull/128 PR: https://git.openjdk.org/jdk19/pull/128 From dlong at openjdk.org Fri Jul 8 23:27:48 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 8 Jul 2022 23:27:48 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: On Fri, 8 Jul 2022 21:57:33 GMT, Jatin Bhateja wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for Reverse Byte vector operations. > Auto-vectorization analysis based on vector IR opcode existence and target backed implementation for existing Java SE APIs [Short/Character/Integer/Long].reverseBytes passes since both vector IR and backend support already exist for these operations. This bug fix patch handled missing scalar reverse byte IR cases in SLP optimizer to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks good to me, but I still have a problem with the error message. It seems like we could give a better error message if we detected the missing vectorization support earlier. What do you think? ------------- PR: https://git.openjdk.org/jdk19/pull/128 From kvn at openjdk.org Fri Jul 8 23:37:45 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Jul 2022 23:37:45 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 23:24:02 GMT, Cesar Soares wrote: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - ~~The original Phi node should be merging Allocate nodes in all inputs.~~ > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Linux. fastdebug -> hotspot_all, renaissance, dacapo This is good starting point. To have new "Phi" type node to collect information about merged allocation. I need more time to dive into changes to give review. I currently found one issue we need discuss - merge allocation of different subclasses (of the same parent class) which may have different number of fields. Current implementation assume objects are the same but I don't see the check for it during RAM node creation. May be we should have it at this initial implementation. What about `adjust_scalar_replaceable_state()` code mark allocation as non-SR if they are merged? About input memory slices. Since merged allocation are SR we should have some new memory Phi created in EA `split_memory_phi()` which we can try to identify instead of adding all memory slices we find (I am talking about RAM constructor). I see you bailed compilation to recompile in case you can't remove RAM node. I think it is fine for initial implementation. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Fri Jul 8 23:54:42 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 8 Jul 2022 23:54:42 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: <66n1I1OY5osM-whrohpbSL9iZebbOwIWUC6shwbNybE=.c053c657-9de4-4118-a876-d741c5850d2c@github.com> On Fri, 8 Jul 2022 21:57:33 GMT, Jatin Bhateja wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for Reverse Byte vector operations. > Auto-vectorization analysis based on vector IR opcode existence and target backed implementation for existing Java SE APIs [Short/Character/Integer/Long].reverseBytes passes since both vector IR and backend support already exist for these operations. This bug fix patch handled missing scalar reverse byte IR cases in SLP optimizer to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks good to me. Thank you for fixing it. I will run testing. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From kvn at openjdk.org Sat Jul 9 04:25:27 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 9 Jul 2022 04:25:27 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: <4WPJvLoIioVl4o0Ro5YcBOg82izdiv1T0Re1nTGwOEo=.3d3bbe1d-fb5e-4908-be4f-fd5266c2d04a@github.com> On Wed, 6 Jul 2022 07:51:01 GMT, Fei Gao wrote: > Superword doesn't vectorize any nodes of non-primitive types and > thus sets `allow_address` false when calling type2aelembytes() in > SuperWord::data_size()[1]. Therefore, when we try to resolve the > data size for a node of T_ADDRESS type, the assertion in > type2aelembytes()[2] takes effect. > > We try to resolve the data sizes for node s and node t in the > SuperWord::adjust_alignment_for_type_conversion()[3] when type > conversion between different data sizes happens. The issue is, > when node s is a ConvI2L node and node t is an AddP node of > T_ADDRESS type, type2aelembytes() will assert. To fix it, we > should filter out all non-primitive nodes, like the patch does > in SuperWord::adjust_alignment_for_type_conversion(). Since > it's a failure in the mid-end, all superword available platforms > are affected. In my local test, this failure can be reproduced > on both x86 and aarch64. With this patch, the failure can be fixed. > > Apart from fixing the bug, the patch also adds necessary type check > and does some clean-up in SuperWord::longer_type_for_conversion() > and VectorCastNode::implemented(). > > [1]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1417 > [2]https://github.com/openjdk/jdk/blob/b96ba19807845739b36274efb168dd048db819a3/src/hotspot/share/utilities/globalDefinitions.cpp#L326 > [3]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1454 Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9391 From kvn at openjdk.org Sat Jul 9 04:31:43 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 9 Jul 2022 04:31:43 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v5] In-Reply-To: References: Message-ID: On Fri, 8 Jul 2022 08:38:36 GMT, Jatin Bhateja wrote: >> Hi All, >> >> [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. >> >> X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. >> >> This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. >> >> Please find below the JMH micro stats with and without patch. >> >> >> >> System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] >> >> Baseline: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms >> >> >> With Opt: >> Benchmark (inSize) (outSize) Mode Cnt Score Error Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms >> >> >> Predicated memory operation over sub-word type will be handled in a subsequent patch. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8289186 > - 8289186: jcheck failure > - 8289186: Review comments resolved. > - 8289186: Review comments resolved. > - 8289186: Support predicated vector load/store operations over X86 AVX2 targets. Testing results are good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9324 From jbhateja at openjdk.org Sat Jul 9 15:14:57 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Jul 2022 15:14:57 GMT Subject: RFR: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. [v5] In-Reply-To: <2s4pqMPNlfWxBbAlMIEN6XrM9GXJpjY8LOvPPO4OFBk=.06eb1a6e-151e-43a8-8c1f-3b16bd67a92b@github.com> References: <2s4pqMPNlfWxBbAlMIEN6XrM9GXJpjY8LOvPPO4OFBk=.06eb1a6e-151e-43a8-8c1f-3b16bd67a92b@github.com> Message-ID: On Fri, 8 Jul 2022 20:11:29 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8289186 >> - 8289186: jcheck failure >> - 8289186: Review comments resolved. >> - 8289186: Review comments resolved. >> - 8289186: Support predicated vector load/store operations over X86 AVX2 targets. > > I started testing. Thanks @vnkozlov , @XiaohongGong for reviews. ------------- PR: https://git.openjdk.org/jdk/pull/9324 From jbhateja at openjdk.org Sat Jul 9 15:16:14 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 9 Jul 2022 15:16:14 GMT Subject: Integrated: 8289186: Support predicated vector load/store operations over X86 AVX2 targets. In-Reply-To: References: Message-ID: On Wed, 29 Jun 2022 09:07:48 GMT, Jatin Bhateja wrote: > Hi All, > > [JDK-8283667](https://bugs.openjdk.org/browse/JDK-8283667) added the support to handle masked loads on non-predicated targets by blending the loaded contents with zero vector iff unmasked portion of load does not span beyond array bounds. > > X86 AVX2 offers direct predicated vector loads/store instruction for non-sub word type. > > This patch adds the efficient backend implementation for predicated memory operations over int/long/float/double vectors. > > Please find below the JMH micro stats with and without patch. > > > > System : Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz [28C 2S Cascadelake Server] > > Baseline: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 712.218 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 156.912 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 255.814 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 267.688 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 140.957 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 474.009 ops/ms > > > With Opt: > Benchmark (inSize) (outSize) Mode Cnt Score Error Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 1026 1152 thrpt 2 742.781 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 1026 1152 thrpt 2 1241.021 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 1026 1152 thrpt 2 2333.311 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 1026 1152 thrpt 2 3258.754 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 1026 1152 thrpt 2 1757.192 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 1026 1152 thrpt 2 472.590 ops/ms > > > Predicated memory operation over sub-word type will be handled in a subsequent patch. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 81ee7d28 Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/81ee7d28f8cb9f6c7fb6d2c76a0f14fd5147d93c Stats: 330 lines in 8 files changed: 290 ins; 39 del; 1 mod 8289186: Support predicated vector load/store operations over X86 AVX2 targets. Reviewed-by: xgong, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9324 From duke at openjdk.org Sun Jul 10 16:20:16 2022 From: duke at openjdk.org (Yi-Fan Tsai) Date: Sun, 10 Jul 2022 16:20:16 GMT Subject: RFR: 8263377: Store method handle linkers in the 'non-nmethods' heap [v5] In-Reply-To: References: Message-ID: > 8263377: Store method handle linkers in the 'non-nmethods' heap Yi-Fan Tsai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: - Fix merge difference - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Post dynamic_code_generate event when MH intrinsic generated - Remove dead codes remove unused argument of NativeJump::check_verified_entry_alignment remove unused argument of NativeJumip::patch_verified_entry remove dead codes in SharedRuntime::generate_method_handle_intrinsic_wrapper - Add PrintCodeCache support - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - Move to RuntimeBlob - Merge branch 'master' of https://github.com/yftsai/jdk into intrinsics - ... and 10 more: https://git.openjdk.org/jdk/compare/87aa3ce0...f65f7c08 ------------- Changes: https://git.openjdk.org/jdk/pull/8760/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=8760&range=04 Stats: 586 lines in 58 files changed: 279 ins; 174 del; 133 mod Patch: https://git.openjdk.org/jdk/pull/8760.diff Fetch: git fetch https://git.openjdk.org/jdk pull/8760/head:pull/8760 PR: https://git.openjdk.org/jdk/pull/8760 From kvn at openjdk.org Sun Jul 10 22:45:47 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sun, 10 Jul 2022 22:45:47 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: On Fri, 8 Jul 2022 21:57:33 GMT, Jatin Bhateja wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for Reverse Byte vector operations. > Auto-vectorization analysis based on vector IR opcode existence and target backed implementation for existing Java SE APIs [Short/Character/Integer/Long].reverseBytes passes since both vector IR and backend support already exist for these operations. This bug fix patch handled missing scalar reverse byte IR cases in SLP optimizer to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Failed: test\micro\org\openjdk\bench\java\lang\Longs.java:148: error: incompatible types: possible lossy conversion from long to int [2022-07-10T21:01:01,470Z] int r = Long.reverseBytes(longArraySmall[i]); ------------- PR: https://git.openjdk.org/jdk19/pull/128 From fgao at openjdk.org Mon Jul 11 01:37:39 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 11 Jul 2022 01:37:39 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: <6KltnZFZDqf5kuIjF_t0ns-DVjqSaLj8kFfBxS6rwt0=.aa0372ae-3c0b-4b00-b005-5431a113c9f4@github.com> References: <6KltnZFZDqf5kuIjF_t0ns-DVjqSaLj8kFfBxS6rwt0=.aa0372ae-3c0b-4b00-b005-5431a113c9f4@github.com> Message-ID: On Fri, 8 Jul 2022 20:32:24 GMT, Vladimir Kozlov wrote: >>> In which call to `adjust_alignment_for_type_conversion()` you got AddP node? Should we add checks there too? >> >> Thanks for your review, @vnkozlov . >> >> When we called `adjust_alignment_for_type_conversion()` in `SuperWord::follow_def_uses()`, https://github.com/openjdk/jdk/blob/3f1174aa4709aabcfde8b40deec88b8ed466cc06/src/hotspot/share/opto/superword.cpp#L1525, we got AddP node. In this function, we also call `stmts_can_pack()` on the next line, which has checks to prevent unwanted pairs, https://github.com/openjdk/jdk/blob/3f1174aa4709aabcfde8b40deec88b8ed466cc06/src/hotspot/share/opto/superword.cpp#L1202. Maybe we don't have to add one more. WDYT? > > @fg1417 `stmts_can_pack()` is called in an other place which is preceded by `are_adjacent_refs()` call which also has primitive type check (but different). I was thinking to convert checks in `stmts_can_pack()` to `assert`. But, on other hand, `is_java_primitive(bt)` is cheap and I would prefer to keep checks in `stmts_can_pack()` as they are in case we call it in an other place. > > Anyway. After looking on code I agree with your current changes. Let me test it. And you need second review. Thanks for your review and test work, @vnkozlov . May I have a second review please? ------------- PR: https://git.openjdk.org/jdk/pull/9391 From pli at openjdk.org Mon Jul 11 08:54:24 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 11 Jul 2022 08:54:24 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 Message-ID: Fuzzer tests report an assertion failure issue in C2 global code motion phase. Git bisection shows the problem starts after our fix of post loop vectorization (JDK-8183390). After some narrowing down work, we find it is caused by below change in that patch. @@ -422,14 +404,7 @@ cl->mark_passed_slp(); } cl->mark_was_slp(); - if (cl->is_main_loop()) { - cl->set_slp_max_unroll(local_loop_unroll_factor); - } else if (post_loop_allowed) { - if (!small_basic_type) { - // avoid replication context for small basic types in programmable masked loops - cl->set_slp_max_unroll(local_loop_unroll_factor); - } - } + cl->set_slp_max_unroll(local_loop_unroll_factor); } } This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it helps find a loop's max unroll count via some analysis. In the original code, we have loop type checks and the slp max unroll value is set for only some types of loops. But in JDK-8183390, the check was removed by mistake. In my current understanding, the slp max unroll value applies to slp candidate loops only - either main loops or RCE'd post loops - so that check shouldn't be removed. After restoring it we don't see the assertion failure any more. The new jtreg created in this patch can reproduce the failed assertion, which checks `def_block->dominates(block)` - the domination relationship of two blocks. But in the case, I found the blocks are in an unreachable inner loop, which I think ought to be optimized away in some previous C2 phases. As I'm not quite familiar with the C2's global code motion, so far I still don't understand how slp max unroll count eventually causes that problem. This patch just restores the if condition which I removed incorrectly in JDK-8183390. But I still suspect that there is another hidden bug exists in C2. I would be glad if any reviewers can give me some guidance or suggestions. Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. ------------- Commit messages: - 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 Changes: https://git.openjdk.org/jdk19/pull/130/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=130&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289954 Stats: 89 lines in 2 files changed: 88 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/130.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/130/head:pull/130 PR: https://git.openjdk.org/jdk19/pull/130 From dnsimon at openjdk.org Mon Jul 11 09:10:06 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Jul 2022 09:10:06 GMT Subject: RFR: 8290065: [JVMCI] only check HotSpotCompiledCode stream is empty if installation succeeds Message-ID: Decoding the HotSpotCompiledCode stream (see [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094)) can be short circuited if certain limits are encountered such as the code cache being full or `JVMCINMethodSizeLimit` being exceeded. This PR omits the check that the complete stream has been read should be emitted if such a limit is hit. ------------- Commit messages: - only check empty HotSpotCompiledCode stream if CodeInstallResult == ok Changes: https://git.openjdk.org/jdk/pull/9446/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9446&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290065 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9446.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9446/head:pull/9446 PR: https://git.openjdk.org/jdk/pull/9446 From jbhateja at openjdk.org Mon Jul 11 14:00:25 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Jul 2022 14:00:25 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v2] In-Reply-To: References: Message-ID: <95P_JGByJWfTuz66uQvpFaP9Ka57EHxZDQ5nlp7MAc0=.127084fa-0b7c-4e5b-bf58-f3fdadc44796@github.com> > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for Reverse Byte vector operations. > Auto-vectorization analysis based on vector IR opcode existence and target backed implementation for existing Java SE APIs [Short/Character/Integer/Long].reverseBytes passes since both vector IR and backend support already exist for these operations. This bug fix patch handled missing scalar reverse byte IR cases in SLP optimizer to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8288112: Correcting micros. ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/128/files - new: https://git.openjdk.org/jdk19/pull/128/files/9915cc79..b4fc2c7b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=00-01 Stats: 12 lines in 2 files changed: 6 ins; 2 del; 4 mod Patch: https://git.openjdk.org/jdk19/pull/128.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/128/head:pull/128 PR: https://git.openjdk.org/jdk19/pull/128 From jbhateja at openjdk.org Mon Jul 11 14:00:26 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Jul 2022 14:00:26 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: On Fri, 8 Jul 2022 21:57:33 GMT, Jatin Bhateja wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for Reverse Byte vector operations. > Auto-vectorization analysis based on vector IR opcode existence and target backed implementation for existing Java SE APIs [Short/Character/Integer/Long].reverseBytes passes since both vector IR and backend support already exist for these operations. This bug fix patch handled missing scalar reverse byte IR cases in SLP optimizer to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin > test\micro\org\openjdk\bench\java\lang\Longs.java:148: error: incompatible types: possible lossy conversion from long to in > Looks good to me, but I still have a problem with the error message. It seems like we could give a better error message if we detected the missing vectorization support earlier. What do you think? Hi @dean-long , Thanks for your comments, error occurs because SLP analysis passes all the checks to enable creation vector ReverseByte IR, but SLP backend had this case missing due to which some of the IR nodes were vectorized and were feeding into a scalar ReverseByte node, thus while doing a value computation compiler encounters a meet operation b/w vector and scalar lattice. Since its an error related to internals of JIT compiler it will not be of any use to user and just represent incorrect control path selected by compiler. With this patch we do not hit the trap any more. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From dnsimon at openjdk.org Mon Jul 11 16:50:13 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Jul 2022 16:50:13 GMT Subject: RFR: 8290065: [JVMCI] only check HotSpotCompiledCode stream is empty if installation succeeds In-Reply-To: References: Message-ID: <4DnvAOwooi5BX9v51z473HW3wtL4Ez6ts4VbfowQ0pA=.b2bede1a-04d6-47f0-a278-08f6282e5f87@github.com> On Mon, 11 Jul 2022 16:25:51 GMT, Vladimir Kozlov wrote: >> Decoding the HotSpotCompiledCode stream (see [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094)) can be short circuited if certain limits are encountered such as the code cache being full or `JVMCINMethodSizeLimit` being exceeded. >> This PR omits the check that the complete stream has been read should be emitted if such a limit is hit. > > Good and trivial. Thanks for the review @vnkozlov . ------------- PR: https://git.openjdk.org/jdk/pull/9446 From dnsimon at openjdk.org Mon Jul 11 16:50:13 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Mon, 11 Jul 2022 16:50:13 GMT Subject: Integrated: 8290065: [JVMCI] only check HotSpotCompiledCode stream is empty if installation succeeds In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:59:20 GMT, Doug Simon wrote: > Decoding the HotSpotCompiledCode stream (see [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094)) can be short circuited if certain limits are encountered such as the code cache being full or `JVMCINMethodSizeLimit` being exceeded. > This PR omits the check that the complete stream has been read should be emitted if such a limit is hit. This pull request has now been integrated. Changeset: 21db9a50 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/21db9a507b441dbf909720b0b394f563e03aafc3 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8290065: [JVMCI] only check HotSpotCompiledCode stream is empty if installation succeeds Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/9446 From kvn at openjdk.org Mon Jul 11 16:52:48 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Jul 2022 16:52:48 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: <9AIZ8Ln_Yjwtkw9wepi-lS1G4NMjPaLXUVxuL6Xithk=.02591ccb-af5e-411f-bc1e-1fd89cc6a858@github.com> On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. Good. I will test it. There could be code which use `slp_max_unroll` value as indicator of `main` loop. Or setting `slp_max_unroll` to pre-/post-loop exposed a bug. I suggest to go with your fix for JDK 19 and may be investigate the issue in JDK 20. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From kvn at openjdk.org Mon Jul 11 16:41:43 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Jul 2022 16:41:43 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation In-Reply-To: References: Message-ID: <8O41Cd2Okto_f1iAJobk-wn6ohUBO2FzNkN2rgtFVzM=.83a57d23-1694-45b1-a893-c42e2adc8be0@github.com> On Mon, 11 Jul 2022 12:55:02 GMT, Jatin Bhateja wrote: > - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. > - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). > > Please review and share your feedback. > > Best Regards, > Jatin test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java line 138: > 136: "Xlog", > 137: "UseAVX", > 138: "UseKNLSetting", I don't think we should add these flags to whitelist in these changes - they can affect generated code. New RFE is filed already to handle such flags: [8289801](https://bugs.openjdk.org/browse/JDK-8289801) ------------- PR: https://git.openjdk.org/jdk/pull/9452 From kvn at openjdk.org Mon Jul 11 16:27:52 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Jul 2022 16:27:52 GMT Subject: RFR: 8290065: [JVMCI] only check HotSpotCompiledCode stream is empty if installation succeeds In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:59:20 GMT, Doug Simon wrote: > Decoding the HotSpotCompiledCode stream (see [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094)) can be short circuited if certain limits are encountered such as the code cache being full or `JVMCINMethodSizeLimit` being exceeded. > This PR omits the check that the complete stream has been read should be emitted if such a limit is hit. Good and trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9446 From dlong at openjdk.org Mon Jul 11 19:00:47 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Jul 2022 19:00:47 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 13:56:23 GMT, Jatin Bhateja wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. >> For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. >> >> A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > >> test\micro\org\openjdk\bench\java\lang\Longs.java:148: error: incompatible types: possible lossy conversion from long to in > >> Looks good to me, but I still have a problem with the error message. It seems like we could give a better error message if we detected the missing vectorization support earlier. What do you think? > > Hi @dean-long , Thanks for your comments, error occurs because SLP analysis passes all the checks to enable creation of vector ReverseByte IR, but SLP backend had this case missing due to which some of the IR nodes were vectorized and were feeding into a scalar ReverseByte node, thus while doing a value computation compiler encounters a meet operation b/w vector and scalar lattice. Since its an error related to internals of JIT compiler it will not be adding any value to user and just represent incorrect control path selected by compiler. With this patch we do not hit the trap any more. @jatin-bhateja Right, a better error message wouldn't be a value-add to users, but if we could detect it sooner, like in SuperWord::output(), that might be useful to compiler engineers debugging issues. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From duke at openjdk.org Mon Jul 11 19:38:44 2022 From: duke at openjdk.org (Cesar Soares) Date: Mon, 11 Jul 2022 19:38:44 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: <11HraylVrw1hGLKWmAQBOSOcDxIXx7ZlY5Z7zp2vEvY=.923afa89-5606-4b2f-be93-48930480c076@github.com> On Fri, 8 Jul 2022 23:34:28 GMT, Vladimir Kozlov wrote: > This is good starting point. To have new "Phi" type node to collect information about merged allocation. > I need more time to dive into changes to give review. Thanks for taking the time to look into this! > I currently found one issue we need discuss - merge allocation of different subclasses (of the same parent class) which may have different number of fields. Current implementation assume objects are the same but I don't see the check for it during RAM node creation. May be we should have it at this initial implementation. Got it. I'll create some tests for this and see what happens. > What about adjust_scalar_replaceable_state() code mark allocation as non-SR if they are merged? I didn't get this part. Can you please clarify? > About input memory slices. Since merged allocation are SR we should have some new memory Phi created in EA split_memory_phi() which we can try to identify instead of adding all memory slices we find (I am talking about RAM constructor). I'll take a look into that. Thanks for the suggestion. > I see you bailed compilation to recompile in case you can't remove RAM node. I think it is fine for initial implementation. Great. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From jbhateja at openjdk.org Mon Jul 11 20:50:38 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Jul 2022 20:50:38 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 13:56:23 GMT, Jatin Bhateja wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. >> For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. >> >> A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > >> test\micro\org\openjdk\bench\java\lang\Longs.java:148: error: incompatible types: possible lossy conversion from long to in > >> Looks good to me, but I still have a problem with the error message. It seems like we could give a better error message if we detected the missing vectorization support earlier. What do you think? > > Hi @dean-long , Thanks for your comments, error occurs because SLP analysis passes all the checks to enable creation of vector ReverseByte IR, but SLP backend had this case missing due to which some of the IR nodes were vectorized and were feeding into a scalar ReverseByte node, thus while doing a value computation compiler encounters a meet operation b/w vector and scalar lattice. Since its an error related to internals of JIT compiler it will not be adding any value to user and just represent incorrect control path selected by compiler. With this patch we do not hit the trap any more. > @jatin-bhateja Right, a better error message wouldn't be a value-add to users, but if we could detect it sooner, like in SuperWord::output(), that might be useful to compiler engineers debugging issues. Hi @dean-long , Agree, I have changed the error message generated with -XX:+TraceLoopOpts to be more explicit like. **SWPointer::output: Unhandled scalar opcode (ReverseBytesI), ShouldNotReachHere, exiting SuperWord** Since patch already handles the missing scalar case, hence this message will not be generated. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From jbhateja at openjdk.org Mon Jul 11 20:59:36 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Jul 2022 20:59:36 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v3] In-Reply-To: References: Message-ID: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. > For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8288112: Modifying SLP error generated with -XX:+TraceLoopOpts ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/128/files - new: https://git.openjdk.org/jdk19/pull/128/files/b4fc2c7b..d3556cbb Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk19/pull/128.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/128/head:pull/128 PR: https://git.openjdk.org/jdk19/pull/128 From jbhateja at openjdk.org Mon Jul 11 21:03:30 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Jul 2022 21:03:30 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: > - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. > - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). > > Please review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8290066: Removing newly added white listed options. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9452/files - new: https://git.openjdk.org/jdk/pull/9452/files/923abfd0..c7036cde Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9452&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9452&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 2 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9452.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9452/head:pull/9452 PR: https://git.openjdk.org/jdk/pull/9452 From jbhateja at openjdk.org Mon Jul 11 21:03:30 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 11 Jul 2022 21:03:30 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: <8O41Cd2Okto_f1iAJobk-wn6ohUBO2FzNkN2rgtFVzM=.83a57d23-1694-45b1-a893-c42e2adc8be0@github.com> References: <8O41Cd2Okto_f1iAJobk-wn6ohUBO2FzNkN2rgtFVzM=.83a57d23-1694-45b1-a893-c42e2adc8be0@github.com> Message-ID: On Mon, 11 Jul 2022 16:38:12 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8290066: Removing newly added white listed options. > > test/hotspot/jtreg/compiler/lib/ir_framework/TestFramework.java line 138: > >> 136: "Xlog", >> 137: "UseAVX", >> 138: "UseKNLSetting", > > I don't think we should add these flags to whitelist in these changes - they can affect generated code. > New RFE is filed already to handle such flags: [8289801](https://bugs.openjdk.org/browse/JDK-8289801) Done. ------------- PR: https://git.openjdk.org/jdk/pull/9452 From dlong at openjdk.org Mon Jul 11 21:03:50 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Jul 2022 21:03:50 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: <3cX2IdssLz4arXwc028K93wtFv7PywXph9JYx7LTToc=.4be13b21-20c1-4aa2-bd25-ad8b07b69a9e@github.com> On Mon, 11 Jul 2022 20:47:02 GMT, Jatin Bhateja wrote: >>> test\micro\org\openjdk\bench\java\lang\Longs.java:148: error: incompatible types: possible lossy conversion from long to in >> >>> Looks good to me, but I still have a problem with the error message. It seems like we could give a better error message if we detected the missing vectorization support earlier. What do you think? >> >> Hi @dean-long , Thanks for your comments, error occurs because SLP analysis passes all the checks to enable creation of vector ReverseByte IR, but SLP backend had this case missing due to which some of the IR nodes were vectorized and were feeding into a scalar ReverseByte node, thus while doing a value computation compiler encounters a meet operation b/w vector and scalar lattice. Since its an error related to internals of JIT compiler it will not be adding any value to user and just represent incorrect control path selected by compiler. With this patch we do not hit the trap any more. > >> @jatin-bhateja Right, a better error message wouldn't be a value-add to users, but if we could detect it sooner, like in SuperWord::output(), that might be useful to compiler engineers debugging issues. > > Hi @dean-long , Agree, I have changed the error message generated with -XX:+TraceLoopOpts to be more explicit like. > **SWPointer::output: Unhandled scalar opcode (ReverseBytesI), ShouldNotReachHere, exiting SuperWord** > > Since patch already handles the missing scalar case, hence this message will not be generated. @jatin-bhateja OK, thanks. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From dlong at openjdk.org Mon Jul 11 22:43:02 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 11 Jul 2022 22:43:02 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v2] In-Reply-To: References: Message-ID: <_TCVGOPioj8dj6ZJkBxBOj2S26Z5SrVI4oQ4Jxg_bG8=.941452c4-e0b8-4161-b09c-359e74c7cfb0@github.com> On Mon, 11 Jul 2022 15:36:35 GMT, Martin Doerr wrote: >> Preserve volatile vector registers in ZGC C2 load barrier stub. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Avoid using more than the volatile program storage (288 Bytes) on stack below the SP. Does this need to be fixed in jdk19? ------------- PR: https://git.openjdk.org/jdk/pull/9453 From kvn at openjdk.org Mon Jul 11 23:45:38 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Jul 2022 23:45:38 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: <11HraylVrw1hGLKWmAQBOSOcDxIXx7ZlY5Z7zp2vEvY=.923afa89-5606-4b2f-be93-48930480c076@github.com> References: <11HraylVrw1hGLKWmAQBOSOcDxIXx7ZlY5Z7zp2vEvY=.923afa89-5606-4b2f-be93-48930480c076@github.com> Message-ID: On Mon, 11 Jul 2022 19:34:47 GMT, Cesar Soares wrote: > > What about adjust_scalar_replaceable_state() code mark allocation as non-SR if they are merged? > > I didn't get this part. Can you please clarify? My bad. After looking more on changes I noticed that you exit `compute_escape()` before `adjust_scalar_replaceable_state()` is called. So my comment is null. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Mon Jul 11 23:47:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Jul 2022 23:47:55 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/130 From kvn at openjdk.org Mon Jul 11 23:49:43 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 11 Jul 2022 23:49:43 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 21:03:30 GMT, Jatin Bhateja wrote: >> - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. >> - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8290066: Removing newly added white listed options. Good. I will start testing. ------------- PR: https://git.openjdk.org/jdk/pull/9452 From duke at openjdk.org Tue Jul 12 00:59:23 2022 From: duke at openjdk.org (Cesar Soares) Date: Tue, 12 Jul 2022 00:59:23 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: <11HraylVrw1hGLKWmAQBOSOcDxIXx7ZlY5Z7zp2vEvY=.923afa89-5606-4b2f-be93-48930480c076@github.com> Message-ID: <6WxVYuLGhNTyHtNdfDPzJMSecT6fTD4Ba4y-eMFwVA0=.5e9fa8f7-b0ea-449c-a2bd-4acd608477df@github.com> On Mon, 11 Jul 2022 23:42:13 GMT, Vladimir Kozlov wrote: > After looking more on changes I noticed that you exit compute_escape() before adjust_scalar_replaceable_state() is called. No worries. I'm currently working to make `reduce_allocation_merges` be executed as part of compute_escape so that we can take benefit from iterative EA executions. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From pli at openjdk.org Tue Jul 12 01:50:29 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 12 Jul 2022 01:50:29 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 23:44:25 GMT, Vladimir Kozlov wrote: >> Fuzzer tests report an assertion failure issue in C2 global code motion >> phase. Git bisection shows the problem starts after our fix of post loop >> vectorization (JDK-8183390). After some narrowing down work, we find it >> is caused by below change in that patch. >> >> >> @@ -422,14 +404,7 @@ >> cl->mark_passed_slp(); >> } >> cl->mark_was_slp(); >> - if (cl->is_main_loop()) { >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } else if (post_loop_allowed) { >> - if (!small_basic_type) { >> - // avoid replication context for small basic types in programmable masked loops >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } >> - } >> + cl->set_slp_max_unroll(local_loop_unroll_factor); >> } >> } >> >> >> This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it >> helps find a loop's max unroll count via some analysis. In the original >> code, we have loop type checks and the slp max unroll value is set for >> only some types of loops. But in JDK-8183390, the check was removed by >> mistake. In my current understanding, the slp max unroll value applies >> to slp candidate loops only - either main loops or RCE'd post loops - >> so that check shouldn't be removed. After restoring it we don't see the >> assertion failure any more. >> >> The new jtreg created in this patch can reproduce the failed assertion, >> which checks `def_block->dominates(block)` - the domination relationship >> of two blocks. But in the case, I found the blocks are in an unreachable >> inner loop, which I think ought to be optimized away in some previous C2 >> phases. As I'm not quite familiar with the C2's global code motion, so >> far I still don't understand how slp max unroll count eventually causes >> that problem. This patch just restores the if condition which I removed >> incorrectly in JDK-8183390. But I still suspect that there is another >> hidden bug exists in C2. I would be glad if any reviewers can give me >> some guidance or suggestions. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > > Testing passed. @vnkozlov Thanks for looking at this. I think a 2nd review is required, right? ------------- PR: https://git.openjdk.org/jdk19/pull/130 From kvn at openjdk.org Tue Jul 12 02:00:39 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 12 Jul 2022 02:00:39 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 23:44:25 GMT, Vladimir Kozlov wrote: >> Fuzzer tests report an assertion failure issue in C2 global code motion >> phase. Git bisection shows the problem starts after our fix of post loop >> vectorization (JDK-8183390). After some narrowing down work, we find it >> is caused by below change in that patch. >> >> >> @@ -422,14 +404,7 @@ >> cl->mark_passed_slp(); >> } >> cl->mark_was_slp(); >> - if (cl->is_main_loop()) { >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } else if (post_loop_allowed) { >> - if (!small_basic_type) { >> - // avoid replication context for small basic types in programmable masked loops >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } >> - } >> + cl->set_slp_max_unroll(local_loop_unroll_factor); >> } >> } >> >> >> This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it >> helps find a loop's max unroll count via some analysis. In the original >> code, we have loop type checks and the slp max unroll value is set for >> only some types of loops. But in JDK-8183390, the check was removed by >> mistake. In my current understanding, the slp max unroll value applies >> to slp candidate loops only - either main loops or RCE'd post loops - >> so that check shouldn't be removed. After restoring it we don't see the >> assertion failure any more. >> >> The new jtreg created in this patch can reproduce the failed assertion, >> which checks `def_block->dominates(block)` - the domination relationship >> of two blocks. But in the case, I found the blocks are in an unreachable >> inner loop, which I think ought to be optimized away in some previous C2 >> phases. As I'm not quite familiar with the C2's global code motion, so >> far I still don't understand how slp max unroll count eventually causes >> that problem. This patch just restores the if condition which I removed >> incorrectly in JDK-8183390. But I still suspect that there is another >> hidden bug exists in C2. I would be glad if any reviewers can give me >> some guidance or suggestions. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > > Testing passed. > @vnkozlov Thanks for looking at this. I think a 2nd review is required, right? yes ------------- PR: https://git.openjdk.org/jdk19/pull/130 From mdoerr at openjdk.org Tue Jul 12 04:56:26 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 04:56:26 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 15:36:35 GMT, Martin Doerr wrote: >> Preserve volatile vector registers in ZGC C2 load barrier stub. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Avoid using more than the volatile program storage (288 Bytes) on stack below the SP. Would be nice to have in 19, but it doesn't apply cleanly. There is a workaround. I prefer avoiding merging work for Oracle employees. We need it in 17u and 21 LTS. ------------- PR: https://git.openjdk.org/jdk/pull/9453 From rrich at openjdk.org Tue Jul 12 07:15:45 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 12 Jul 2022 07:15:45 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 15:36:35 GMT, Martin Doerr wrote: >> Preserve volatile vector registers in ZGC C2 load barrier stub. > > Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: > > Avoid using more than the volatile program storage (288 Bytes) on stack below the SP. src/hotspot/cpu/ppc/gc/z/zBarrierSetAssembler_ppc.cpp line 486: > 484: assert(SuperwordUseVSX, "or should not reach here"); > 485: VectorSRegister vs_reg = vm_reg->as_VectorSRegister(); > 486: if (vs_reg->encoding() >= VSR32->encoding() && vs_reg->encoding() <= VSR51->encoding()) { Why VSR32 as lower bound? I read in ppc.ad 1st 32 VSRs are aliases for the FPRs wich are already defined above. Could you please help and explain what this means? Why VSR51 as upper bound? I'd suggest to update the comment in register_ppc.hpp and explain the vector scalar registers. What is the difference between vector and vector scalar registers? ------------- PR: https://git.openjdk.org/jdk/pull/9453 From jbhateja at openjdk.org Tue Jul 12 08:03:50 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 12 Jul 2022 08:03:50 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v4] In-Reply-To: References: Message-ID: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. > For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8288112: Minor adjustment to benchmark. ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/128/files - new: https://git.openjdk.org/jdk19/pull/128/files/d3556cbb..43bfa40d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=02-03 Stats: 4 lines in 2 files changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/128.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/128/head:pull/128 PR: https://git.openjdk.org/jdk19/pull/128 From mdoerr at openjdk.org Tue Jul 12 08:06:41 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 08:06:41 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v2] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 07:13:42 GMT, Richard Reingruber wrote: >> Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: >> >> Avoid using more than the volatile program storage (288 Bytes) on stack below the SP. > > src/hotspot/cpu/ppc/gc/z/zBarrierSetAssembler_ppc.cpp line 486: > >> 484: assert(SuperwordUseVSX, "or should not reach here"); >> 485: VectorSRegister vs_reg = vm_reg->as_VectorSRegister(); >> 486: if (vs_reg->encoding() >= VSR32->encoding() && vs_reg->encoding() <= VSR51->encoding()) { > > Why VSR32 as lower bound? I read in ppc.ad > > 1st 32 VSRs are aliases for the FPRs wich are already defined above. > > Could you please help and explain what this means? > > Why VSR51 as upper bound? > > I'd suggest to update the comment in register_ppc.hpp and explain the vector scalar registers. > What is the difference between vector and vector scalar registers? Thanks for looking at it! VSRs are not separate registers. They contain the regular FPRs (mapped to 0-31) and VRs (mapped to 32-63). FPRs are managed separately while the VRs are not defined elsewhere in the ppc.ad file. There are instructions which operate on VSRs and can access FPRs and VRs. This was tricky to implement in hotspot ([JDK-8188139](https://bugs.openjdk.org/browse/JDK-8188139) and many follow-up fixes). Only the VRs VR0-VR19 are volatile (see register_ppc.hpp), so only these ones need spilling. (Same is done for other register types.) VR0-VR19 = VSR32-VSR51 Note that only these ones are currently used by C2 (see `reg_class vs_reg` in ppc.ad). Reason is that we currently don't preserve the non-volatile ones in the Java entry frame. ------------- PR: https://git.openjdk.org/jdk/pull/9453 From rrich at openjdk.org Tue Jul 12 09:04:42 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 12 Jul 2022 09:04:42 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v2] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 08:02:51 GMT, Martin Doerr wrote: > Thanks for looking at it! VSRs are not separate registers. They contain the > regular FPRs (mapped to 0-31) and VRs (mapped to 32-63). FPRs are managed > separately while the VRs are not defined elsewhere in the ppc.ad file. Thanks. I think this should be better explained in register_ppc.hpp. > There are instructions which operate on VSRs and can access FPRs and VRs. This > was tricky to implement in hotspot > ([JDK-8188139](https://bugs.openjdk.org/browse/JDK-8188139) and many follow-up > fixes). Only the VRs VR0-VR19 are volatile (see register_ppc.hpp), so only > these ones need spilling. (Same is done for other register types.) VR0-VR19 = > VSR32-VSR51 > Note that only these ones are currently used by C2 (see `reg_class > vs_reg` in ppc.ad). Reason is that we currently don't preserve the > non-volatile ones in the Java entry frame. I see. VSR52-VSR64 are declared SOC in ppc.ad. Shouldn't they be SOE then? ------------- PR: https://git.openjdk.org/jdk/pull/9453 From mdoerr at openjdk.org Tue Jul 12 13:35:53 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 13:35:53 GMT Subject: Integrated: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 14:09:58 GMT, Martin Doerr wrote: > Preserve volatile vector registers in ZGC C2 load barrier stub. This pull request has now been integrated. Changeset: 393dc7ad Author: Martin Doerr URL: https://git.openjdk.org/jdk/commit/393dc7ade716485f4452d0185caf9e630e4c6139 Stats: 121 lines in 5 files changed: 41 ins; 6 del; 74 mod 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers Reviewed-by: eosterlund, rrich ------------- PR: https://git.openjdk.org/jdk/pull/9453 From tholenstein at openjdk.org Tue Jul 12 14:14:28 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 12 Jul 2022 14:14:28 GMT Subject: RFR: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window Message-ID: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> A graph in IGV can be moved by dragging it with the left mouse button (called panning). ![panning](https://user-images.githubusercontent.com/71546117/178509416-24dd900f-131b-484b-af47-c7a78e791434.png) If the mouse left the visible window of the graph during dragging, the diagram started to move in the opposite direction. This was annoying. Now panning stops as soon as the mouse leaves the window. ![stop reverse panning](https://user-images.githubusercontent.com/71546117/178509309-3df03b7a-ada4-45a3-b9a7-d6e10664033d.png) In selection mode, the graph still moves when the mouse is dragged outside the window, as this is meant to make a larger selection. ![keep panning for selection](https://user-images.githubusercontent.com/71546117/178509302-74fa41d2-e611-40a3-b6b0-c937ef4b2462.png) ------------- Commit messages: - remove imports - JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window Changes: https://git.openjdk.org/jdk/pull/9470/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9470&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290016 Stats: 77 lines in 2 files changed: 23 ins; 20 del; 34 mod Patch: https://git.openjdk.org/jdk/pull/9470.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9470/head:pull/9470 PR: https://git.openjdk.org/jdk/pull/9470 From mdoerr at openjdk.org Tue Jul 12 09:37:35 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 09:37:35 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v3] In-Reply-To: References: Message-ID: > Preserve volatile vector registers in ZGC C2 load barrier stub. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Update SOE spec for VSR regs. Add comment to register_ppc.hpp ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9453/files - new: https://git.openjdk.org/jdk/pull/9453/files/bb0513c1..2d8fa980 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9453&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9453&range=01-02 Stats: 40 lines in 2 files changed: 4 ins; 0 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/9453.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9453/head:pull/9453 PR: https://git.openjdk.org/jdk/pull/9453 From fjiang at openjdk.org Tue Jul 12 09:43:16 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 12 Jul 2022 09:43:16 GMT Subject: RFR: 8290164: compiler/runtime/TestConstantsInError.java fails on riscv Message-ID: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> compiler/runtime/TestConstantsInError.java fails on riscv with the following error: Execution failed: `main' threw exception: java.lang.RuntimeException: 'made not entrant' found in stdout Similar to AArch64, RISCV64 does not patch C1 compiled code (see [JDK-8223613](https://bugs.openjdk.org/browse/JDK-8223613)). So we should add `Platform.isRISCV64` too for the test. According to [JDK-8246494](https://bugs.openjdk.org/browse/JDK-8246494), `vm.flagless` will be excluded from runs w/ any other X / XX flags passed via -vmoption / -javaoption. We added the `-Xmx` option for all jtreg tests, so the failure was not aware before. ------------- Commit messages: - Add Platform.isRISCV64 for compiler/runtime/TestConstantsInError.java Changes: https://git.openjdk.org/jdk/pull/9463/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9463&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290164 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9463.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9463/head:pull/9463 PR: https://git.openjdk.org/jdk/pull/9463 From fyang at openjdk.org Tue Jul 12 09:49:41 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 12 Jul 2022 09:49:41 GMT Subject: RFR: 8290164: compiler/runtime/TestConstantsInError.java fails on riscv In-Reply-To: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> References: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> Message-ID: On Tue, 12 Jul 2022 09:36:37 GMT, Feilong Jiang wrote: > compiler/runtime/TestConstantsInError.java fails on riscv with the following error: > > > Execution failed: `main' threw exception: java.lang.RuntimeException: 'made not entrant' found in stdout > > > Similar to AArch64, RISCV64 does not patch C1 compiled code (see [JDK-8223613](https://bugs.openjdk.org/browse/JDK-8223613)). So we should add `Platform.isRISCV64` too for the test. > > According to [JDK-8246494](https://bugs.openjdk.org/browse/JDK-8246494), `vm.flagless` will be excluded from runs w/ any other X / XX flags passed via -vmoption / -javaoption. We added the `-Xmx` option for all jtreg tests, so the failure was not aware before. Looks good and reasonable. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.org/jdk/pull/9463 From mdoerr at openjdk.org Tue Jul 12 09:55:02 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 09:55:02 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v4] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 09:43:54 GMT, Martin Doerr wrote: >> Preserve volatile vector registers in ZGC C2 load barrier stub. > > Martin Doerr has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Update SOE spec for VSR regs. Add comment to register_ppc.hpp > - Avoid using more than the volatile program storage (288 Bytes) on stack below the SP. > - 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers Thanks for the reviews! I just fixed a typo in a comment. ------------- PR: https://git.openjdk.org/jdk/pull/9453 From rrich at openjdk.org Tue Jul 12 09:55:00 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 12 Jul 2022 09:55:00 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v3] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 09:37:35 GMT, Martin Doerr wrote: >> Preserve volatile vector registers in ZGC C2 load barrier stub. > > Martin Doerr has refreshed the contents of this pull request, and previous commits have been removed. Incremental views are not available. Thanks Martin, your changes looks good to me now. The commenting in register_ppc.hpp could still be improved though. E.g. the comment refers to `v` and `vs` registers but the declared names are `VR` and `VSR`. Probably the declared names should be changed but that's nothing to be done in this pr. Thanks, Richard. ------------- Marked as reviewed by rrich (Reviewer). PR: https://git.openjdk.org/jdk/pull/9453 From duke at openjdk.org Tue Jul 12 11:52:26 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Tue, 12 Jul 2022 11:52:26 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max Message-ID: When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : Before this patch: aarch64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op x86-64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op After this patch: aarch64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op x86-64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : aarch64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op x86-64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op There is no degradation when vectorization is disabled. This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as - MaxI(x + c0, MaxI(y + c1, z)) ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y ------------- Commit messages: - 8288107: Auto-vectorization for integer min/max Changes: https://git.openjdk.org/jdk/pull/9466/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9466&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8288107 Stats: 561 lines in 7 files changed: 384 ins; 171 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9466.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9466/head:pull/9466 PR: https://git.openjdk.org/jdk/pull/9466 From tholenstein at openjdk.org Tue Jul 12 13:29:18 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Tue, 12 Jul 2022 13:29:18 GMT Subject: RFR: JDK-8290069: IGV: Highlight both graphs of difference in outline Message-ID: Previously, IGV highlighted only one graph in the outline when a difference graph is selected using the sliders. Now, IGV highlights both graphs used to calculate the difference graph when they are in the same group. highlight both graphs IGV colors the nodes in a difference graph with yellow/red/green to highlight the changes. This only worked if the difference graph is calculated using the sliders. Now, difference graphs is also coloured when calculated via the context menu "Difference to current graph" in the outline. Show colors ------------- Commit messages: - JDK-8290069: IGV: Highlight both graphs of difference in outline Changes: https://git.openjdk.org/jdk/pull/9468/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9468&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290069 Stats: 36 lines in 5 files changed: 27 ins; 0 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/9468.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9468/head:pull/9468 PR: https://git.openjdk.org/jdk/pull/9468 From mdoerr at openjdk.org Tue Jul 12 09:43:54 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 09:43:54 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v4] In-Reply-To: References: Message-ID: > Preserve volatile vector registers in ZGC C2 load barrier stub. Martin Doerr has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Update SOE spec for VSR regs. Add comment to register_ppc.hpp - Avoid using more than the volatile program storage (288 Bytes) on stack below the SP. - 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers ------------- Changes: https://git.openjdk.org/jdk/pull/9453/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9453&range=03 Stats: 121 lines in 5 files changed: 41 ins; 6 del; 74 mod Patch: https://git.openjdk.org/jdk/pull/9453.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9453/head:pull/9453 PR: https://git.openjdk.org/jdk/pull/9453 From mdoerr at openjdk.org Tue Jul 12 09:46:16 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 09:46:16 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v2] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 09:00:44 GMT, Richard Reingruber wrote: >> Thanks for looking at it! >> VSRs are not separate registers. They contain the regular FPRs (mapped to 0-31) and VRs (mapped to 32-63). FPRs are managed separately while the VRs are not defined elsewhere in the ppc.ad file. There are instructions which operate on VSRs and can access FPRs and VRs. This was tricky to implement in hotspot ([JDK-8188139](https://bugs.openjdk.org/browse/JDK-8188139) and many follow-up fixes). >> Only the VRs VR0-VR19 are volatile (see register_ppc.hpp), so only these ones need spilling. (Same is done for other register types.) >> VR0-VR19 = VSR32-VSR51 >> Note that only these ones are currently used by C2 (see `reg_class vs_reg` in ppc.ad). Reason is that we currently don't preserve the non-volatile ones in the Java entry frame. > >> Thanks for looking at it! VSRs are not separate registers. They contain the >> regular FPRs (mapped to 0-31) and VRs (mapped to 32-63). FPRs are managed >> separately while the VRs are not defined elsewhere in the ppc.ad file. > > Thanks. I think this should be better explained in register_ppc.hpp. > >> There are instructions which operate on VSRs and can access FPRs and VRs. This >> was tricky to implement in hotspot >> ([JDK-8188139](https://bugs.openjdk.org/browse/JDK-8188139) and many follow-up >> fixes). Only the VRs VR0-VR19 are volatile (see register_ppc.hpp), so only >> these ones need spilling. (Same is done for other register types.) VR0-VR19 = >> VSR32-VSR51 >> Note that only these ones are currently used by C2 (see `reg_class >> vs_reg` in ppc.ad). Reason is that we currently don't preserve the >> non-volatile ones in the Java entry frame. > > I see. VSR52-VSR64 are declared SOC in ppc.ad. Shouldn't they be SOE then? I've added a comment to register_ppc.hpp. Right, they should be SOE. Changed. Note that this doesn't have any effect because the SOE registers are not allocated by C2. But should get fixed to avoid confusion and for possible future usage. ------------- PR: https://git.openjdk.org/jdk/pull/9453 From mdoerr at openjdk.org Tue Jul 12 09:54:59 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 12 Jul 2022 09:54:59 GMT Subject: RFR: 8290082: [PPC64] ZGC C2 load barrier stub needs to preserve vector registers [v5] In-Reply-To: References: Message-ID: > Preserve volatile vector registers in ZGC C2 load barrier stub. Martin Doerr has updated the pull request incrementally with one additional commit since the last revision: Fix typo in comment. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9453/files - new: https://git.openjdk.org/jdk/pull/9453/files/f6d238ed..fab3fa4a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9453&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9453&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9453.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9453/head:pull/9453 PR: https://git.openjdk.org/jdk/pull/9453 From kvn at openjdk.org Tue Jul 12 21:27:55 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 12 Jul 2022 21:27:55 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 21:03:30 GMT, Jatin Bhateja wrote: >> - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. >> - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8290066: Removing newly added white listed options. Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9452 From kvn at openjdk.org Tue Jul 12 21:34:49 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 12 Jul 2022 21:34:49 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v4] In-Reply-To: References: Message-ID: <8HjAKIYovRJUxTw3ycaj-VO-P9K_JsL3ONmKQcBIT6k=.c260f0a9-aaea-4fe6-9e4a-1ec582a492e4@github.com> On Tue, 12 Jul 2022 08:03:50 GMT, Jatin Bhateja wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. >> For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. >> >> A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8288112: Minor adjustment to benchmark. Tests copyright header validation failed because you forgot `,` after `2022` year in TestLongVect.java and TestShortVect.java. Otherwise testing results are good. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From jbhateja at openjdk.org Wed Jul 13 02:17:47 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Jul 2022 02:17:47 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 21:03:30 GMT, Jatin Bhateja wrote: >> - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. >> - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8290066: Removing newly added white listed options. Hi @kvn do we need second approval for this. Facing integration issue. ------------- PR: https://git.openjdk.org/jdk/pull/9452 From jbhateja at openjdk.org Wed Jul 13 04:58:57 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Jul 2022 04:58:57 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v5] In-Reply-To: References: Message-ID: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. > For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8288112: Correcting type on copyright header. ------------- Changes: - all: https://git.openjdk.org/jdk19/pull/128/files - new: https://git.openjdk.org/jdk19/pull/128/files/43bfa40d..eb517046 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk19&pr=128&range=03-04 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/128.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/128/head:pull/128 PR: https://git.openjdk.org/jdk19/pull/128 From jbhateja at openjdk.org Wed Jul 13 04:58:59 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Jul 2022 04:58:59 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v4] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 08:03:50 GMT, Jatin Bhateja wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. >> For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. >> >> A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8288112: Minor adjustment to benchmark. > Thanks, fixed now ------------- PR: https://git.openjdk.org/jdk19/pull/128 From kvn at openjdk.org Wed Jul 13 07:08:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Jul 2022 07:08:46 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v5] In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 04:58:57 GMT, Jatin Bhateja wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. >> For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. >> >> A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8288112: Correcting type on copyright header. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/128 From kvn at openjdk.org Wed Jul 13 07:09:49 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Jul 2022 07:09:49 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 02:11:06 GMT, Jatin Bhateja wrote: > Hi @kvn do we need second approval for this. Facing integration issue. It is known issue which is investigated. Yes, you need second review. ------------- PR: https://git.openjdk.org/jdk/pull/9452 From dnsimon at openjdk.org Wed Jul 13 12:40:44 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 13 Jul 2022 12:40:44 GMT Subject: RFR: 8290234: [JVMCI] use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning Message-ID: JVMCI Java code must never read a raw `Klass*` value from memory (using `Unsafe`) that is not already known to be wrapped in a `HotSpotResolvedObjectTypeImpl` without going through a VM call. The VM call is necessary so that the `Klass*` is handlized in a `JVMCIKlassHandle` to protect it from the concurrent scanning done by G1. This PR re-introduces the VM calls that were mistakenly optimized away in [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094). ------------- Commit messages: - use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning Changes: https://git.openjdk.org/jdk/pull/9480/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9480&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290234 Stats: 21 lines in 3 files changed: 8 ins; 5 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9480.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9480/head:pull/9480 PR: https://git.openjdk.org/jdk/pull/9480 From duke at openjdk.org Wed Jul 13 13:19:01 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Wed, 13 Jul 2022 13:19:01 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 11:45:28 GMT, Bhavana-Kilambi wrote: > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. > > This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as - > MaxI(x + c0, MaxI(y + c1, z)) ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y > MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y Hi, Thank you for your review. I have answered your questions below : **- Is MaxINode::Ideal transformation is required for auto-vectorization?** No, it is not required but these transformations are good to have as MaxI node does not have fundamental transformations defined, for ex ? optimizing patterns like Max(a, a). So I have added the same transformations as MinINode so that it handles the above pattern and a couple of others. I bundled this together with the auto-vectorization code as I felt it would be good to have some basic optimizations defined for MaxINode (MinINode already has a few), which could help improve overall performance of this patch for certain cases. I can put it in a separate PR if that's better. **- I don't see how it will "generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop."** I have performed testing mainly on aarch64 and x86 machines. With this patch, I could see vector equivalent instructions for min/max operations (as described in the commit message) for vectorizable loops. For other non-vectorizable or scalar versions, the MinI/MaxI nodes in the respective *.ad files translate to compare-conditional-move instructions. - As for the code generation for the Arrays.copyOfRange() (from the comment on JDK-8039104), these are my findings on an x86_64 machine : Without my patch, it generates a cmp-cmovl sequence for the ?min? operation in copyOfRange() while with my patch, it generates a cmp-cmovg. The assembly listing is shown below for both the cases - Without my patch : . . . . 0x00007fb3890b05ac: mov %ecx,%r11d 0x00007fb3890b05af: sub %edx,%r11d ;*isub {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 2 (line 3819) 0x00007fb3890b05b2: test %r11d,%r11d 0x00007fb3890b05b5: jl 0x00007fb3890b0824 ;*ifge {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 5 (line 3820) . . . . . 0x00007fb3890b05d0: mov %rsi,%rcx 0x00007fb3890b05d3: mov 0xc(%rsi),%r8d ; implicit exception: dispatches to 0x00007fb3890b083c ;*arraylength {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 51 (line 3823) 0x00007fb3890b05d7: mov %edx,%r10d . . . . . 0x00007fb3890b05e6: mov %r8d,%ebx 0x00007fb3890b05e9: sub %edx,%ebx ;*isub {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 53 (line 3823) **0x00007fb3890b05eb: cmp %r11d,%ebx** 0x00007fb3890b05ee: mov %r11d,%ebp **0x00007fb3890b05f1: cmovl %ebx,%ebp ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 55 (line 3824)** With my patch : . . . . . 0x00007efce90b10ac: mov %ecx,%r11d 0x00007efce90b10af: sub %edx,%r11d ;*isub {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 2 (line 3819) 0x00007efce90b10b2: test %r11d,%r11d 0x00007efce90b10b5: jl 0x00007efce90b1324 ;*ifge {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 5 (line 3820) . . . . . 0x00007efce90b10d0: mov %rsi,%rcx 0x00007efce90b10d3: mov 0xc(%rsi),%r8d ; implicit exception: dispatches to 0x00007efce90b133c ;*arraylength {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 51 (line 3823) 0x00007efce90b10d7: mov %edx,%r10d . . . . . 0x00007efce90b10e6: mov %r8d,%ebp 0x00007efce90b10e9: sub %edx,%ebp **0x00007efce90b10eb: cmp %r11d,%ebp 0x00007efce90b10ee: cmovg %r11d,%ebp ;*invokestatic min {reexecute=0 rethrow=0 return_oop=0} ; - java.util.Arrays::copyOfRange at 55 (line 3824)** Although I suspected there should not be any degradation with my patch, I tried to do a quick bench-marking of this testcase with and without my patch to confirm, and here are the results - With my patch - ByteArrMinMax.copyRange 2048 0 avgt 30 8.297 ? 0.305 ns/op Without my patch - ByteArrMinMax.copyRange 2048 0 avgt 30 8.446 ? 0.105 ns/op There isn't much difference in performance between the two cases. The above tests were run with -XX:-TieredCompilation flag to ensure the copyOfRange is being compiled by c2. ------------- PR: https://git.openjdk.org/jdk/pull/9466 From kvn at openjdk.org Wed Jul 13 14:44:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Jul 2022 14:44:04 GMT Subject: RFR: 8290234: [JVMCI] use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 12:24:50 GMT, Doug Simon wrote: > JVMCI Java code must never read a raw `Klass*` value from memory (using `Unsafe`) that is not already known to be wrapped in a `HotSpotResolvedObjectTypeImpl` without going through a VM call. The VM call is necessary so that the `Klass*` is handlized in a `JVMCIKlassHandle` to protect it from the concurrent scanning done by G1. This PR re-introduces the VM calls that were mistakenly optimized away in [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094). Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9480 From kvn at openjdk.org Wed Jul 13 14:45:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Jul 2022 14:45:04 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 11:45:28 GMT, Bhavana-Kilambi wrote: > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. > > This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as - > MaxI(x + c0, MaxI(y + c1, z)) ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y > MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y I agree that MaxINode::Ideal is useful but it should be done separately to see effects of intrinsic changes only. ------------- PR: https://git.openjdk.org/jdk/pull/9466 From kvn at openjdk.org Wed Jul 13 14:55:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 13 Jul 2022 14:55:02 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 11:45:28 GMT, Bhavana-Kilambi wrote: > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. > > This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as - > MaxI(x + c0, MaxI(y + c1, z)) ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y > MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y Thank you for checking Arrays.copyOfRange() code generation. I will start our benchmarks testing without **MaxINode::Ideal** ------------- PR: https://git.openjdk.org/jdk/pull/9466 From jbhateja at openjdk.org Wed Jul 13 16:49:06 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Jul 2022 16:49:06 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 21:03:30 GMT, Jatin Bhateja wrote: >> - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. >> - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8290066: Removing newly added white listed options. @chhagedorn , can you kindly review and approve. ------------- PR: https://git.openjdk.org/jdk/pull/9452 From jbhateja at openjdk.org Wed Jul 13 16:50:21 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 13 Jul 2022 16:50:21 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: <3cX2IdssLz4arXwc028K93wtFv7PywXph9JYx7LTToc=.4be13b21-20c1-4aa2-bd25-ad8b07b69a9e@github.com> References: <3cX2IdssLz4arXwc028K93wtFv7PywXph9JYx7LTToc=.4be13b21-20c1-4aa2-bd25-ad8b07b69a9e@github.com> Message-ID: On Mon, 11 Jul 2022 21:01:35 GMT, Dean Long wrote: >>> @jatin-bhateja Right, a better error message wouldn't be a value-add to users, but if we could detect it sooner, like in SuperWord::output(), that might be useful to compiler engineers debugging issues. >> >> Hi @dean-long , Agree, I have changed the error message generated with -XX:+TraceLoopOpts to be more explicit like. >> **SWPointer::output: Unhandled scalar opcode (ReverseBytesI), ShouldNotReachHere, exiting SuperWord** >> >> Since patch already handles the missing scalar case, hence this message will not be generated. > > @jatin-bhateja OK, thanks. Hi @dean-long , kindly approve if the patch version looks ok to you. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From never at openjdk.org Wed Jul 13 17:34:05 2022 From: never at openjdk.org (Tom Rodriguez) Date: Wed, 13 Jul 2022 17:34:05 GMT Subject: RFR: 8290234: [JVMCI] use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 12:24:50 GMT, Doug Simon wrote: > JVMCI Java code must never read a raw `Klass*` value from memory (using `Unsafe`) that is not already known to be wrapped in a `HotSpotResolvedObjectTypeImpl` without going through a VM call. The VM call is necessary so that the `Klass*` is handlized in a `JVMCIKlassHandle` to protect it from the concurrent scanning done by G1. This PR re-introduces the VM calls that were mistakenly optimized away in [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094). Marked as reviewed by never (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9480 From dnsimon at openjdk.org Wed Jul 13 19:18:19 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 13 Jul 2022 19:18:19 GMT Subject: RFR: 8290234: [JVMCI] use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 12:24:50 GMT, Doug Simon wrote: > JVMCI Java code must never read a raw `Klass*` value from memory (using `Unsafe`) that is not already known to be wrapped in a `HotSpotResolvedObjectTypeImpl` without going through a VM call. The VM call is necessary so that the `Klass*` is handlized in a `JVMCIKlassHandle` to protect it from the concurrent scanning done by G1. This PR re-introduces the VM calls that were mistakenly optimized away in [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094). Thanks for the reviews. ------------- PR: https://git.openjdk.org/jdk/pull/9480 From dnsimon at openjdk.org Wed Jul 13 19:18:21 2022 From: dnsimon at openjdk.org (Doug Simon) Date: Wed, 13 Jul 2022 19:18:21 GMT Subject: Integrated: 8290234: [JVMCI] use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 12:24:50 GMT, Doug Simon wrote: > JVMCI Java code must never read a raw `Klass*` value from memory (using `Unsafe`) that is not already known to be wrapped in a `HotSpotResolvedObjectTypeImpl` without going through a VM call. The VM call is necessary so that the `Klass*` is handlized in a `JVMCIKlassHandle` to protect it from the concurrent scanning done by G1. This PR re-introduces the VM calls that were mistakenly optimized away in [JDK-8289094](https://bugs.openjdk.org/browse/JDK-8289094). This pull request has now been integrated. Changeset: 74ac5df9 Author: Doug Simon URL: https://git.openjdk.org/jdk/commit/74ac5df96fb4344f005180f8643cb0c9223b1556 Stats: 21 lines in 3 files changed: 8 ins; 5 del; 8 mod 8290234: [JVMCI] use JVMCIKlassHandle to protect raw Klass* values from concurrent G1 scanning Reviewed-by: kvn, never ------------- PR: https://git.openjdk.org/jdk/pull/9480 From dlong at openjdk.org Wed Jul 13 23:28:04 2022 From: dlong at openjdk.org (Dean Long) Date: Wed, 13 Jul 2022 23:28:04 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v5] In-Reply-To: References: Message-ID: On Wed, 13 Jul 2022 04:58:57 GMT, Jatin Bhateja wrote: >> [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. >> For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. >> >> A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8288112: Correcting type on copyright header. Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk19/pull/128 From fjiang at openjdk.org Thu Jul 14 01:19:59 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 14 Jul 2022 01:19:59 GMT Subject: RFR: 8290164: compiler/runtime/TestConstantsInError.java fails on riscv In-Reply-To: References: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> Message-ID: <-Ti0U-u2ZH-HD9T7HbrBOixA32I5TvVU7ZFoJJUlo5E=.58d0bbfa-5bd3-4fed-84a5-c6088cd83a6d@github.com> On Tue, 12 Jul 2022 09:46:24 GMT, Fei Yang wrote: >> compiler/runtime/TestConstantsInError.java fails on riscv with the following error: >> >> >> Execution failed: `main' threw exception: java.lang.RuntimeException: 'made not entrant' found in stdout >> >> >> Similar to AArch64, RISCV64 does not patch C1 compiled code (see [JDK-8223613](https://bugs.openjdk.org/browse/JDK-8223613)). So we should add `Platform.isRISCV64` too for the test. >> >> This test requires vm.flagless. According to [JDK-8246494](https://bugs.openjdk.org/browse/JDK-8246494), tests with `vm.flagless` will be excluded from runs w/ any other X / XX flags passed via -vmoption / -javaoption. >> Since we added `-Xmx` option for all jtreg tests, so this failure does not menifest before. >> >> After this fixing, compiler/runtime/TestConstantsInError.java passed without failure. > > Looks good and reasonable. @RealFYang -- thanks for the review! Integrate then. ------------- PR: https://git.openjdk.org/jdk/pull/9463 From jbhateja at openjdk.org Thu Jul 14 01:50:06 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 14 Jul 2022 01:50:06 GMT Subject: [jdk19] RFR: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() [v4] In-Reply-To: <8HjAKIYovRJUxTw3ycaj-VO-P9K_JsL3ONmKQcBIT6k=.c260f0a9-aaea-4fe6-9e4a-1ec582a492e4@github.com> References: <8HjAKIYovRJUxTw3ycaj-VO-P9K_JsL3ONmKQcBIT6k=.c260f0a9-aaea-4fe6-9e4a-1ec582a492e4@github.com> Message-ID: <7quQEuy_HVCMe_gsIJDlXmLwhEdqmCZ2Z34NLkV_MLg=.8027f5a2-dcd5-4691-bb89-8c729993125e@github.com> On Tue, 12 Jul 2022 21:29:04 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8288112: Minor adjustment to benchmark. > > Tests copyright header validation failed because you forgot `,` after `2022` year in TestLongVect.java and TestShortVect.java. > Otherwise testing results are good. Thanks @vnkozlov @dean-long. ------------- PR: https://git.openjdk.org/jdk19/pull/128 From jbhateja at openjdk.org Thu Jul 14 01:50:09 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 14 Jul 2022 01:50:09 GMT Subject: [jdk19] Integrated: 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() In-Reply-To: References: Message-ID: On Fri, 8 Jul 2022 21:57:33 GMT, Jatin Bhateja wrote: > [JDK-8284960](https://bugs.openjdk.org/browse/JDK-8284960) added new vector IR nodes and target specific backend support for ReverseByte vector operations. > For each scalar operation, auto-vectorizer analysis stage checks for existence of vector IR opcode and target specific backend implementation. While processing scalar IR nodes corresponding to Java SE APIs [Short/Character/Integer/Long].reverseBytes SLP analysis checks passes since relevant support already existed. This bug fix patch handles missing scalar reversebyte opcode checks in SLP backed to enable creation of corresponding vector IR nodes. > > A new JBS issue [JDK-8290034](https://bugs.openjdk.org/browse/JDK-8290034) is created to add the missing auto-vectorization support for bit reverse operation targeting JDK mainline. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: fd89ab8d Author: Jatin Bhateja URL: https://git.openjdk.org/jdk19/commit/fd89ab8dacda1d6af5bd4be57a83362c8cdd5e20 Stats: 216 lines in 9 files changed: 208 ins; 0 del; 8 mod 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() Reviewed-by: dlong, kvn ------------- PR: https://git.openjdk.org/jdk19/pull/128 From pli at openjdk.org Thu Jul 14 02:46:01 2022 From: pli at openjdk.org (Pengfei Li) Date: Thu, 14 Jul 2022 02:46:01 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. @dean-long Do you have any comments or suggestions on this? The failure was reported from your fuzzer test. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From yadongwang at openjdk.org Thu Jul 14 03:19:10 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Thu, 14 Jul 2022 03:19:10 GMT Subject: RFR: 8290164: compiler/runtime/TestConstantsInError.java fails on riscv In-Reply-To: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> References: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> Message-ID: On Tue, 12 Jul 2022 09:36:37 GMT, Feilong Jiang wrote: > compiler/runtime/TestConstantsInError.java fails on riscv with the following error: > > > Execution failed: `main' threw exception: java.lang.RuntimeException: 'made not entrant' found in stdout > > > Similar to AArch64, RISCV64 does not patch C1 compiled code (see [JDK-8223613](https://bugs.openjdk.org/browse/JDK-8223613)). So we should add `Platform.isRISCV64` too for the test. > > This test requires vm.flagless. According to [JDK-8246494](https://bugs.openjdk.org/browse/JDK-8246494), tests with `vm.flagless` will be excluded from runs w/ any other X / XX flags passed via -vmoption / -javaoption. > Since we added `-Xmx` option for all jtreg tests, so this failure does not menifest before. > > After this fixing, compiler/runtime/TestConstantsInError.java passed without failure. lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/9463 From fjiang at openjdk.org Thu Jul 14 03:36:59 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Thu, 14 Jul 2022 03:36:59 GMT Subject: Integrated: 8290164: compiler/runtime/TestConstantsInError.java fails on riscv In-Reply-To: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> References: <9z5icyb1IZEfh1-p-yg_jDzHUmPQKSdcY7OG8kCtIz8=.174855d2-265a-448e-adf5-205192d0b9cd@github.com> Message-ID: On Tue, 12 Jul 2022 09:36:37 GMT, Feilong Jiang wrote: > compiler/runtime/TestConstantsInError.java fails on riscv with the following error: > > > Execution failed: `main' threw exception: java.lang.RuntimeException: 'made not entrant' found in stdout > > > Similar to AArch64, RISCV64 does not patch C1 compiled code (see [JDK-8223613](https://bugs.openjdk.org/browse/JDK-8223613)). So we should add `Platform.isRISCV64` too for the test. > > This test requires vm.flagless. According to [JDK-8246494](https://bugs.openjdk.org/browse/JDK-8246494), tests with `vm.flagless` will be excluded from runs w/ any other X / XX flags passed via -vmoption / -javaoption. > Since we added `-Xmx` option for all jtreg tests, so this failure does not menifest before. > > After this fixing, compiler/runtime/TestConstantsInError.java passed without failure. This pull request has now been integrated. Changeset: 3471ac9a Author: Feilong Jiang Committer: Jie Fu URL: https://git.openjdk.org/jdk/commit/3471ac9a907780d894d05bd58cf883c4c8d8838d Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod 8290164: compiler/runtime/TestConstantsInError.java fails on riscv Reviewed-by: fyang, yadongwang ------------- PR: https://git.openjdk.org/jdk/pull/9463 From dlong at openjdk.org Thu Jul 14 06:15:04 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 14 Jul 2022 06:15:04 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Thu, 14 Jul 2022 02:42:21 GMT, Pengfei Li wrote: >> Fuzzer tests report an assertion failure issue in C2 global code motion >> phase. Git bisection shows the problem starts after our fix of post loop >> vectorization (JDK-8183390). After some narrowing down work, we find it >> is caused by below change in that patch. >> >> >> @@ -422,14 +404,7 @@ >> cl->mark_passed_slp(); >> } >> cl->mark_was_slp(); >> - if (cl->is_main_loop()) { >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } else if (post_loop_allowed) { >> - if (!small_basic_type) { >> - // avoid replication context for small basic types in programmable masked loops >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } >> - } >> + cl->set_slp_max_unroll(local_loop_unroll_factor); >> } >> } >> >> >> This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it >> helps find a loop's max unroll count via some analysis. In the original >> code, we have loop type checks and the slp max unroll value is set for >> only some types of loops. But in JDK-8183390, the check was removed by >> mistake. In my current understanding, the slp max unroll value applies >> to slp candidate loops only - either main loops or RCE'd post loops - >> so that check shouldn't be removed. After restoring it we don't see the >> assertion failure any more. >> >> The new jtreg created in this patch can reproduce the failed assertion, >> which checks `def_block->dominates(block)` - the domination relationship >> of two blocks. But in the case, I found the blocks are in an unreachable >> inner loop, which I think ought to be optimized away in some previous C2 >> phases. As I'm not quite familiar with the C2's global code motion, so >> far I still don't understand how slp max unroll count eventually causes >> that problem. This patch just restores the if condition which I removed >> incorrectly in JDK-8183390. But I still suspect that there is another >> hidden bug exists in C2. I would be glad if any reviewers can give me >> some guidance or suggestions. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > > @dean-long Do you have any comments or suggestions on this? The failure was reported from your fuzzer test. @pfustc Sorry, I'm not enough of an expert on SuperWord to review the fix. The test was generated automatically by Java Fuzzer. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From pli at openjdk.org Thu Jul 14 06:29:46 2022 From: pli at openjdk.org (Pengfei Li) Date: Thu, 14 Jul 2022 06:29:46 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: <_EpfqAojmiLCRvA4JnBLQ5ziiBVyKmnPVFHEWbmExV4=.2c327de0-c159-482b-951b-54f4595164cb@github.com> On Thu, 14 Jul 2022 02:42:21 GMT, Pengfei Li wrote: >> Fuzzer tests report an assertion failure issue in C2 global code motion >> phase. Git bisection shows the problem starts after our fix of post loop >> vectorization (JDK-8183390). After some narrowing down work, we find it >> is caused by below change in that patch. >> >> >> @@ -422,14 +404,7 @@ >> cl->mark_passed_slp(); >> } >> cl->mark_was_slp(); >> - if (cl->is_main_loop()) { >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } else if (post_loop_allowed) { >> - if (!small_basic_type) { >> - // avoid replication context for small basic types in programmable masked loops >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } >> - } >> + cl->set_slp_max_unroll(local_loop_unroll_factor); >> } >> } >> >> >> This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it >> helps find a loop's max unroll count via some analysis. In the original >> code, we have loop type checks and the slp max unroll value is set for >> only some types of loops. But in JDK-8183390, the check was removed by >> mistake. In my current understanding, the slp max unroll value applies >> to slp candidate loops only - either main loops or RCE'd post loops - >> so that check shouldn't be removed. After restoring it we don't see the >> assertion failure any more. >> >> The new jtreg created in this patch can reproduce the failed assertion, >> which checks `def_block->dominates(block)` - the domination relationship >> of two blocks. But in the case, I found the blocks are in an unreachable >> inner loop, which I think ought to be optimized away in some previous C2 >> phases. As I'm not quite familiar with the C2's global code motion, so >> far I still don't understand how slp max unroll count eventually causes >> that problem. This patch just restores the if condition which I removed >> incorrectly in JDK-8183390. But I still suspect that there is another >> hidden bug exists in C2. I would be glad if any reviewers can give me >> some guidance or suggestions. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > > @dean-long Do you have any comments or suggestions on this? The failure was reported from your fuzzer test. > @pfustc Sorry, I'm not enough of an expert on SuperWord to review the fix. The test was generated automatically by Java Fuzzer. May I ask how do you generate and run the Fuzzer tests? Is there any instructions we can follow? Recently we see a couple of SuperWord issues reported by corner cases which are generated by the Fuzzer. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From xliu at openjdk.org Thu Jul 14 07:07:07 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 14 Jul 2022 07:07:07 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: References: Message-ID: <3EPQUiSBD0TfKEMe-ZQ4q26qH4Ll0wpMP0aX1HxY6XY=.ffdf1f4d-6bf3-4882-ad86-5bc23cbbb76f@github.com> On Fri, 8 Jul 2022 15:47:17 GMT, Emanuel Peter wrote: >> I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). >> >> This change sets cleans up the code around, and tries to reduce code duplication. >> >> Things I did: >> - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. >> - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). >> - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. >> - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private >> - Adding `call from debugger` comment to VM functions that are useful in debugger >> - rename `find_node_by_name` to `find_nodes_by_name` and `find_node_by_dump` to `find_nodes_by_dump`. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > implementing Christians review suggestions nice feature! LGTM. I am not a reviewer. we still needs other reviewers to approve it. src/hotspot/share/opto/node.cpp line 2079: > 2077: tty->print(" @: print old nodes - before matching (if available)\n"); > 2078: tty->print(" B: print scheduling blocks (if available)\n"); > 2079: tty->print(" $: dump only, no header, no other columns\n"); why don't we just list those options in alphabetical order? It's a little bit easier to interpret the default value 'cdmxo at B' if we flip order 'print m:... ' and 'print d:...' above. src/hotspot/share/opto/node.cpp line 2671: > 2669: // only_data: whether to regard data edges only during traversal. > 2670: static void collect_nodes_i(GrowableArray* queue, const Node* start, int direction, uint depth, bool include_start, bool only_ctrl, bool only_data) { > 2671: bool indent = depth <= PrintIdealIndentThreshold; hi, @eme64 , Could you also delete PrintIdealIndentThreshold from c2_globals.hpp? This is like a hack to let node::dump() indent. I don't think it's quite useful. it can be done in gdb pretty-print ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/9234 From xliu at openjdk.org Thu Jul 14 07:28:06 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 14 Jul 2022 07:28:06 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: References: Message-ID: On Fri, 24 Jun 2022 09:59:55 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> implementing Christians review suggestions > > Otherwise, nice cleanup! I think it's the right thing to remove unused and unmaintained `dump` methods and reduce code duplication. > > Have you checked that the printed node order with `dump(X)` is the same as before? I'm not sure if that is a strong requirement. I'm just thinking about `PrintIdeal` with which we do: > https://github.com/openjdk/jdk/blob/17aacde50fb971bc686825772e29f6bfecadabda/src/hotspot/share/opto/compile.cpp#L554 > > Some tools/scripts might depend on the previous order of `dump(X)`. But I'm currently not aware of any such order-dependent processing. For the IR framework, the node order does not matter and if I see that correctly, the dump of an individual node is the same as before. So, it should be fine. > @chhagedorn > > > Have you checked that the printed node order with `dump(X)` is the same as before? I'm not sure if that is a strong requirement. > > I did try to make sure that the output of `dump` stays equivalent. As far as I manually inspected, they are. The visit order is the same, and the same nodes are dumped. I also verify that. root()->dump(9999) is still same. furthermore, I update it with colorful style. it looks pretty cool,huh? - root()->dump(9999); + tty->print_raw("AFTER: "); + tty->print_raw_cr(phase_name); + root()->dump_bfs(9999, nullptr, "+#$S"); ![Screen Shot 2022-07-14 at 12 20 46 AM](https://user-images.githubusercontent.com/2386768/178925766-00fa2ddd-4272-4a0b-a207-92f2f64bfcc3.png) ------------- PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Thu Jul 14 08:56:00 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Jul 2022 08:56:00 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: <3EPQUiSBD0TfKEMe-ZQ4q26qH4Ll0wpMP0aX1HxY6XY=.ffdf1f4d-6bf3-4882-ad86-5bc23cbbb76f@github.com> References: <3EPQUiSBD0TfKEMe-ZQ4q26qH4Ll0wpMP0aX1HxY6XY=.ffdf1f4d-6bf3-4882-ad86-5bc23cbbb76f@github.com> Message-ID: <-DYVsjIt1_hUt7JxIZltj-0888CcE1mUvZELp4JDMeA=.ad8df8b4-8af1-41c3-9d78-1cd31b5bd953@github.com> On Wed, 13 Jul 2022 22:07:50 GMT, Xin Liu wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> implementing Christians review suggestions > > src/hotspot/share/opto/node.cpp line 2671: > >> 2669: // only_data: whether to regard data edges only during traversal. >> 2670: static void collect_nodes_i(GrowableArray* queue, const Node* start, int direction, uint depth, bool include_start, bool only_ctrl, bool only_data) { >> 2671: bool indent = depth <= PrintIdealIndentThreshold; > > hi, @eme64 , > Could you also delete PrintIdealIndentThreshold from c2_globals.hpp? > This is like a hack to let node::dump() indent. I don't think it's quite useful. it can be done in gdb pretty-print Oh thank you so much, I overlooked this! Will delete it. ------------- PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Thu Jul 14 09:05:03 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Jul 2022 09:05:03 GMT Subject: RFR: 8288897: Clean up node dump code [v4] In-Reply-To: <3EPQUiSBD0TfKEMe-ZQ4q26qH4Ll0wpMP0aX1HxY6XY=.ffdf1f4d-6bf3-4882-ad86-5bc23cbbb76f@github.com> References: <3EPQUiSBD0TfKEMe-ZQ4q26qH4Ll0wpMP0aX1HxY6XY=.ffdf1f4d-6bf3-4882-ad86-5bc23cbbb76f@github.com> Message-ID: <9hwkahx0MfA0NJHMDKt6G3ePQbS9mJZdA9NIa2UarW4=.ef87d1ef-d1ea-49b3-bfc4-2c9e917d28c8@github.com> On Thu, 14 Jul 2022 06:54:52 GMT, Xin Liu wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> implementing Christians review suggestions > > src/hotspot/share/opto/node.cpp line 2079: > >> 2077: tty->print(" @: print old nodes - before matching (if available)\n"); >> 2078: tty->print(" B: print scheduling blocks (if available)\n"); >> 2079: tty->print(" $: dump only, no header, no other columns\n"); > > why don't we just list those options in alphabetical order? > It's a little bit easier to interpret the default value 'cdmxo at B' if we flip order 'print m:... ' and 'print d:...' above. Thanks, will reorder, good idea ------------- PR: https://git.openjdk.org/jdk/pull/9234 From epeter at openjdk.org Thu Jul 14 09:20:00 2022 From: epeter at openjdk.org (Emanuel Peter) Date: Thu, 14 Jul 2022 09:20:00 GMT Subject: RFR: 8288897: Clean up node dump code [v5] In-Reply-To: References: Message-ID: > I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). > > This change sets cleans up the code around, and tries to reduce code duplication. > > Things I did: > - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. > - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). > - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. > - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private > - Adding `call from debugger` comment to VM functions that are useful in debugger > - rename `find_node_by_name` to `find_nodes_by_name` and `find_node_by_dump` to `find_nodes_by_dump`. Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Merge branch 'master' into JDK-8288897 - review suggestions from @navyxliu - implementing Christians review suggestions - Merge branch 'master' into JDK-8288897 - Apply suggestions from code review 2 style fixes by Christian Co-authored-by: Christian Hagedorn - cleanup, move debug functions to cpp to prevent inlining, add comment for debugger functions - make dump_bfs const, change datastructures, change some signatures to const - refactor dump to use dump_bfs, redefine categories through output types - 8288897: Clean up dump code for nodes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9234/files - new: https://git.openjdk.org/jdk/pull/9234/files/a95b1260..975c0e7c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9234&range=03-04 Stats: 74600 lines in 1570 files changed: 38463 ins; 24775 del; 11362 mod Patch: https://git.openjdk.org/jdk/pull/9234.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9234/head:pull/9234 PR: https://git.openjdk.org/jdk/pull/9234 From mdoerr at openjdk.org Thu Jul 14 09:33:58 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Thu, 14 Jul 2022 09:33:58 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 07:51:01 GMT, Fei Gao wrote: > Superword doesn't vectorize any nodes of non-primitive types and > thus sets `allow_address` false when calling type2aelembytes() in > SuperWord::data_size()[1]. Therefore, when we try to resolve the > data size for a node of T_ADDRESS type, the assertion in > type2aelembytes()[2] takes effect. > > We try to resolve the data sizes for node s and node t in the > SuperWord::adjust_alignment_for_type_conversion()[3] when type > conversion between different data sizes happens. The issue is, > when node s is a ConvI2L node and node t is an AddP node of > T_ADDRESS type, type2aelembytes() will assert. To fix it, we > should filter out all non-primitive nodes, like the patch does > in SuperWord::adjust_alignment_for_type_conversion(). Since > it's a failure in the mid-end, all superword available platforms > are affected. In my local test, this failure can be reproduced > on both x86 and aarch64. With this patch, the failure can be fixed. > > Apart from fixing the bug, the patch also adds necessary type check > and does some clean-up in SuperWord::longer_type_for_conversion() > and VectorCastNode::implemented(). > > [1]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1417 > [2]https://github.com/openjdk/jdk/blob/b96ba19807845739b36274efb168dd048db819a3/src/hotspot/share/utilities/globalDefinitions.cpp#L326 > [3]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1454 Your fix LGTM. The test doesn't show the problem on PPC64, but my original replay file has worked to verify the fix. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/9391 From jwilhelm at openjdk.org Thu Jul 14 13:08:58 2022 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Thu, 14 Jul 2022 13:08:58 GMT Subject: RFR: Merge jdk19 Message-ID: Forwardport JDK 19 -> JDK 20 ------------- Commit messages: - Merge - 8288112: C2: Error: ShouldNotReachHere() in Type::typerr() - 8290209: jcup.md missing additional text - 8290207: Missing notice in dom.md The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=jdk&pr=9493&range=00.0 - jdk19: https://webrevs.openjdk.org/?repo=jdk&pr=9493&range=00.1 Changes: https://git.openjdk.org/jdk/pull/9493/files Stats: 242 lines in 11 files changed: 232 ins; 2 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9493.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9493/head:pull/9493 PR: https://git.openjdk.org/jdk/pull/9493 From xliu at openjdk.org Thu Jul 14 16:29:11 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 14 Jul 2022 16:29:11 GMT Subject: RFR: 8288897: Clean up node dump code [v5] In-Reply-To: References: Message-ID: <3HAbFXya3X-aAzyByYC0akdA4CzX8EwlRwyf4mgwIs4=.cca72da7-a8fc-4c42-b46a-54ba02f73021@github.com> On Thu, 14 Jul 2022 09:20:00 GMT, Emanuel Peter wrote: >> I recently did some work in the area of `Node::dump` and `Node::find`, see [JDK-8287647](https://bugs.openjdk.org/browse/JDK-8287647) and [JDK-8283775](https://bugs.openjdk.org/browse/JDK-8283775). >> >> This change sets cleans up the code around, and tries to reduce code duplication. >> >> Things I did: >> - remove Node::related. It was added 7 years ago, with [JDK-8004073](https://bugs.openjdk.org/browse/JDK-8004073). However, it was not extended to many nodes, and hence it is incomplete, and nobody I know seems to use it. >> - refactor `dump(int)` to use `dump_bfs` (reduce code duplication). >> - redefine categories in `dump_bfs`, focusing on output types. Mixed type is now also control if it has control output, and memory if it has memory output, etc. Plus, a node is also in the control category if it `is_CFG`. This makes `dump_bfs` much more usable, to traverse control and memory flow. >> - Other small cleanups, like replacing rarely used dump functions with dump, making removing dead code, make some functions private >> - Adding `call from debugger` comment to VM functions that are useful in debugger >> - rename `find_node_by_name` to `find_nodes_by_name` and `find_node_by_dump` to `find_nodes_by_dump`. >> - remove now unused dump indent compiler flag `PrintIdealIndentThreshold` (notproduct) > > Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Merge branch 'master' into JDK-8288897 > - review suggestions from @navyxliu > - implementing Christians review suggestions > - Merge branch 'master' into JDK-8288897 > - Apply suggestions from code review > > 2 style fixes by Christian > > Co-authored-by: Christian Hagedorn > - cleanup, move debug functions to cpp to prevent inlining, add comment for debugger functions > - make dump_bfs const, change datastructures, change some signatures to const > - refactor dump to use dump_bfs, redefine categories through output types > - 8288897: Clean up dump code for nodes still LGTM. thanks. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/9234 From jwilhelm at openjdk.org Thu Jul 14 16:34:10 2022 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Thu, 14 Jul 2022 16:34:10 GMT Subject: Integrated: Merge jdk19 In-Reply-To: References: Message-ID: On Thu, 14 Jul 2022 13:02:21 GMT, Jesper Wilhelmsson wrote: > Forwardport JDK 19 -> JDK 20 This pull request has now been integrated. Changeset: 3ad39505 Author: Jesper Wilhelmsson URL: https://git.openjdk.org/jdk/commit/3ad39505605f8eab74adec9c68f211dd44796759 Stats: 242 lines in 11 files changed: 232 ins; 2 del; 8 mod Merge ------------- PR: https://git.openjdk.org/jdk/pull/9493 From kvn at openjdk.org Thu Jul 14 18:16:31 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 14 Jul 2022 18:16:31 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" Message-ID: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. The fix replaces the assert with check. Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. Testing tier1-3,xcomp ------------- Commit messages: - 8290246: test fails "assert(init != __null) failed: initialization not found" Changes: https://git.openjdk.org/jdk/pull/9497/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9497&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290246 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9497.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9497/head:pull/9497 PR: https://git.openjdk.org/jdk/pull/9497 From jbhateja at openjdk.org Thu Jul 14 18:33:50 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 14 Jul 2022 18:33:50 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. Message-ID: Hi All, Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows significant speedup. System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S) Baseline: ========= Benchmark (size) Mode Cnt Score Error Units RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms With-opt: ========= Benchmark (size) Mode Cnt Score Error Units RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. Changes: https://git.openjdk.org/jdk/pull/9498/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9498&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290322 Stats: 197 lines in 4 files changed: 193 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9498.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9498/head:pull/9498 PR: https://git.openjdk.org/jdk/pull/9498 From dlong at openjdk.org Thu Jul 14 19:23:58 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 14 Jul 2022 19:23:58 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: <_EpfqAojmiLCRvA4JnBLQ5ziiBVyKmnPVFHEWbmExV4=.2c327de0-c159-482b-951b-54f4595164cb@github.com> References: <_EpfqAojmiLCRvA4JnBLQ5ziiBVyKmnPVFHEWbmExV4=.2c327de0-c159-482b-951b-54f4595164cb@github.com> Message-ID: On Thu, 14 Jul 2022 06:26:06 GMT, Pengfei Li wrote: >> @dean-long Do you have any comments or suggestions on this? The failure was reported from your fuzzer test. > >> @pfustc Sorry, I'm not enough of an expert on SuperWord to review the fix. The test was generated automatically by Java Fuzzer. > > May I ask how do you generate and run the Fuzzer tests? Is there any instructions we can follow? Recently we see a couple of SuperWord issues reported by corner cases which are generated by the Fuzzer. @pfustc take a look at https://github.com/AzulSystems/JavaFuzzer. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From dlong at openjdk.org Thu Jul 14 19:51:02 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 14 Jul 2022 19:51:02 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" In-Reply-To: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: On Thu, 14 Jul 2022 18:09:49 GMT, Vladimir Kozlov wrote: > CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. > > The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. > Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. > > The fix replaces the assert with check. > > Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. > > Testing tier1-3,xcomp src/hotspot/share/opto/callnode.cpp line 1633: > 1631: if (init == NULL) { > 1632: return NULL; // Return NULL if dead path > 1633: } Can all callers deal with NULL? It looks like GraphKit::array_ideal_length() will crash if make_ideal_length() returns NULL. This fix looks different from what you proposed in the bug report: > Or attach CastII(length) to Allocate node control projection if Initialize node is not present. As I said before we may have legal cases like that (as I remember when working on EA). ------------- PR: https://git.openjdk.org/jdk/pull/9497 From kvn at openjdk.org Thu Jul 14 20:35:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 14 Jul 2022 20:35:06 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" In-Reply-To: References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: On Thu, 14 Jul 2022 19:47:10 GMT, Dean Long wrote: >> CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. >> >> The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. >> Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. >> >> The fix replaces the assert with check. >> >> Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. >> >> Testing tier1-3,xcomp > > src/hotspot/share/opto/callnode.cpp line 1633: > >> 1631: if (init == NULL) { >> 1632: return NULL; // Return NULL if dead path >> 1633: } > > Can all callers deal with NULL? It looks like GraphKit::array_ideal_length() will crash if make_ideal_length() returns NULL. > > This fix looks different from what you proposed in the bug report: >> Or attach CastII(length) to Allocate node control projection if Initialize node is not present. As I said before we may have legal cases like that (as I remember when working on EA). You are right, I should return original `length` instead without Cast. I thought more about my original proposal to attach new CatII(length) to allocation's control projection. But decide that it could be risky. So I decide to keep original length but made mistake by return NULL instead. I will update the fix. ------------- PR: https://git.openjdk.org/jdk/pull/9497 From kvn at openjdk.org Thu Jul 14 20:46:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 14 Jul 2022 20:46:04 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" [v2] In-Reply-To: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: > CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. > > The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. > Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. > > The fix replaces the assert with check. > > Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. > > Testing tier1-3,xcomp Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: Return original length when InitializeNode is absent ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9497/files - new: https://git.openjdk.org/jdk/pull/9497/files/fdf46971..73b101e2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9497&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9497&range=00-01 Stats: 5 lines in 1 file changed: 1 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9497.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9497/head:pull/9497 PR: https://git.openjdk.org/jdk/pull/9497 From kvn at openjdk.org Thu Jul 14 20:54:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 14 Jul 2022 20:54:05 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. In-Reply-To: References: Message-ID: On Thu, 14 Jul 2022 18:23:51 GMT, Jatin Bhateja wrote: > Hi All, > > Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows > significant speedup. > > System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S) > > > Baseline: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms > > > With-opt: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms > > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks good. ------------- PR: https://git.openjdk.org/jdk/pull/9498 From dlong at openjdk.org Thu Jul 14 22:23:02 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 14 Jul 2022 22:23:02 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" [v2] In-Reply-To: References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: On Thu, 14 Jul 2022 20:46:04 GMT, Vladimir Kozlov wrote: >> CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. >> >> The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. >> Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. >> >> The fix replaces the assert with check. >> >> Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. >> >> Testing tier1-3,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Return original length when InitializeNode is absent Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9497 From kvn at openjdk.org Thu Jul 14 23:04:02 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 14 Jul 2022 23:04:02 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" [v2] In-Reply-To: References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: <9FJn1u2R2ErcxZOo_9TQWxetKGQvsNXNZ6loG8ODj2I=.e97a8ee2-09cf-4c49-b741-5240a3c165f2@github.com> On Thu, 14 Jul 2022 20:46:04 GMT, Vladimir Kozlov wrote: >> CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. >> >> The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. >> Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. >> >> The fix replaces the assert with check. >> >> Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. >> >> Testing tier1-3,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Return original length when InitializeNode is absent Thank you, Dean ------------- PR: https://git.openjdk.org/jdk/pull/9497 From xliu at openjdk.org Thu Jul 14 23:36:05 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 14 Jul 2022 23:36:05 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 23:24:02 GMT, Cesar Soares wrote: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - ~~The original Phi node should be merging Allocate nodes in all inputs.~~ > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Linux. fastdebug -> hotspot_all, renaissance, dacapo I run into an error $make test TEST=compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java CONF=linux-x86_64-server-fastdebug One or more @IR rules failed: Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testPollutedPolymorphic(boolean,int)" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIf={}, applyIfAnd={}, failOn={}, applyIfOr={}, counts={"(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java)", "2"}, applyIfNot={})" - counts: Graph contains wrong number of nodes: * Regex 1: (.*precise .*\R((.*(?i:mov|xorl|nop|spill).*|\s*|.*LGHI.*)\R)*.*(?i:call,static).*wrapper for: _new_instance_Java) - Failed comparison: [found] 0 = 2 [given] - No nodes matched! >>> Check stdout for compilation output of the failed methods ############################################################# - To only run the failed tests use -DTest, -DExclude, and/or -DScenarios. - To also get the standard output of the test VM run with -DReportStdout=true or for even more fine-grained logging use -DVerbose=true. ############################################################# compiler.lib.ir_framework.driver.irmatching.IRViolationException: There were one or multiple IR rule failures. Please check stderr for more information. at compiler.lib.ir_framework.driver.irmatching.IRMatcher.throwIfNoSafepointWhilePrinting(IRMatcher.java:91) at compiler.lib.ir_framework.driver.irmatching.IRMatcher.reportFailures(IRMatcher.java:82) at compiler.lib.ir_framework.driver.irmatching.IRMatcher.applyIRRules(IRMatcher.java:54) at compiler.lib.ir_framework.driver.irmatching.IRMatcher.(IRMatcher.java:43) at compiler.lib.ir_framework.TestFramework.runTestVM(TestFramework.java:729) at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:698) at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:329) at compiler.lib.ir_framework.TestFramework.runWithFlags(TestFramework.java:237) at compiler.c2.irTests.scalarReplacement.AllocationMergesTests.main(AllocationMergesTests.java:37) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312) at java.base/java.lang.Thread.run(Thread.java:1596) 2 objects do get scalarized in `testPollutedPolymorphic`. ------------- PR: https://git.openjdk.org/jdk/pull/9073 From iveresov at openjdk.org Thu Jul 14 23:41:59 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Thu, 14 Jul 2022 23:41:59 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" [v2] In-Reply-To: References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: On Thu, 14 Jul 2022 20:46:04 GMT, Vladimir Kozlov wrote: >> CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. >> >> The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. >> Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. >> >> The fix replaces the assert with check. >> >> Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. >> >> Testing tier1-3,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Return original length when InitializeNode is absent Marked as reviewed by iveresov (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9497 From xliu at openjdk.org Fri Jul 15 00:58:54 2022 From: xliu at openjdk.org (Xin Liu) Date: Fri, 15 Jul 2022 00:58:54 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: On Tue, 7 Jun 2022 23:24:02 GMT, Cesar Soares wrote: > Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? > > The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: > 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). > 2) Scalar Replace the incoming allocations to the RAM node. > 3) Scalar Replace the RAM node itself. > > There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: > > - ~~The original Phi node should be merging Allocate nodes in all inputs.~~ > - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. > > These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: > > - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. > - The way I check if there is an incoming Allocate node to the original Phi node. > - The way I check if there is no store to the merged objects after they are merged. > > Testing: > - Linux. fastdebug -> hotspot_all, renaissance, dacapo src/hotspot/share/opto/c2_globals.hpp line 474: > 472: " register allocation.") \ > 473: \ > 474: product(bool, ReduceAllocationMerges, false, \ we need enable it anyway. src/hotspot/share/opto/callnode.cpp line 1704: > 1702: } > 1703: > 1704: bool ReducedAllocationMergeNode::register_use(Node* n) { I think register_use/addp() don't fail in runtime. it will simplify your code. src/hotspot/share/opto/callnode.hpp line 1031: > 1029: // In some cases a reference to the whole merged object is needed and > 1030: // we handle that by creating an SafePointScalarObjectNode. > 1031: class ReducedAllocationMergeNode : public TypeNode { Is that possible to make it just subclass of PhiNode? if so, you wouldn't need to touch cfgnode.[hpp/cpp] src/hotspot/share/opto/escape.cpp line 647: > 645: _igvn->_worklist.push(ram); > 646: > 647: // if (n->_idx == 257 && ram->_idx == 1239) { some mysterious code? :) ------------- PR: https://git.openjdk.org/jdk/pull/9073 From fgao at openjdk.org Fri Jul 15 01:23:05 2022 From: fgao at openjdk.org (Fei Gao) Date: Fri, 15 Jul 2022 01:23:05 GMT Subject: RFR: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: <2W6GMUP5HUpdi6E1LvIqTJAdaw5DVBKnPbmNO6FbHZQ=.20619693-2f98-405e-8df1-a9fd35d02cb9@github.com> On Thu, 14 Jul 2022 09:30:26 GMT, Martin Doerr wrote: > Your fix LGTM. The test doesn't show the problem on PPC64, but my original replay file has worked to verify the fix. Thanks for your review and verification, @TheRealMDoerr . ------------- PR: https://git.openjdk.org/jdk/pull/9391 From pli at openjdk.org Fri Jul 15 02:56:06 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 15 Jul 2022 02:56:06 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: <_EpfqAojmiLCRvA4JnBLQ5ziiBVyKmnPVFHEWbmExV4=.2c327de0-c159-482b-951b-54f4595164cb@github.com> References: <_EpfqAojmiLCRvA4JnBLQ5ziiBVyKmnPVFHEWbmExV4=.2c327de0-c159-482b-951b-54f4595164cb@github.com> Message-ID: <7V3HvH8UjENBhZlk2wvjAdhF1OSBa1InKWwIrGd5AE0=.e3e3730a-ad2c-4458-b109-0d042fcde570@github.com> On Thu, 14 Jul 2022 06:26:06 GMT, Pengfei Li wrote: >> @dean-long Do you have any comments or suggestions on this? The failure was reported from your fuzzer test. > >> @pfustc Sorry, I'm not enough of an expert on SuperWord to review the fix. The test was generated automatically by Java Fuzzer. > > May I ask how do you generate and run the Fuzzer tests? Is there any instructions we can follow? Recently we see a couple of SuperWord issues reported by corner cases which are generated by the Fuzzer. > @pfustc take a look at https://github.com/AzulSystems/JavaFuzzer. Thanks Dean. I will investigate that project. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From kvn at openjdk.org Fri Jul 15 05:10:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 05:10:50 GMT Subject: Integrated: 8290246: test fails "assert(init != __null) failed: initialization not found" In-Reply-To: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: On Thu, 14 Jul 2022 18:09:49 GMT, Vladimir Kozlov wrote: > CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. > > The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. > Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. > > The fix replaces the assert with check. > > Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. > > Testing tier1-3,xcomp This pull request has now been integrated. Changeset: 70fce07a Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/70fce07a382896a8091413d7269bb16f33122505 Stats: 7 lines in 1 file changed: 3 ins; 0 del; 4 mod 8290246: test fails "assert(init != __null) failed: initialization not found" Reviewed-by: dlong, iveresov ------------- PR: https://git.openjdk.org/jdk/pull/9497 From kvn at openjdk.org Fri Jul 15 05:18:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 05:18:50 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max In-Reply-To: References: Message-ID: <2V-kIyntNK7S3v8OO38stJPgTDVHqBD4f4WFpqjTMD8=.89c4c28b-fb33-4da8-9b44-128c5cf3eb88@github.com> On Tue, 12 Jul 2022 11:45:28 GMT, Bhavana-Kilambi wrote: > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. > > This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as - > MaxI(x + c0, MaxI(y + c1, z)) ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y > MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y Performance results (specjvm2008, dacapo, renaissance) on linux-x64 are neutral, just small variations. So it is good. ------------- PR: https://git.openjdk.org/jdk/pull/9466 From kvn at openjdk.org Fri Jul 15 05:10:50 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 05:10:50 GMT Subject: RFR: 8290246: test fails "assert(init != __null) failed: initialization not found" [v2] In-Reply-To: References: <_yDaLQbub4f4pLrQChlEnxceYbu6naNOuQH5kgYs9lY=.fd01b7ec-8de2-49a9-9a70-06c5ff563521@github.com> Message-ID: <6DNV7rTtvXWPCPuE_SKk1yK3mhbKqYciAjxPTX29y6s=.0aad4f20-1fd3-4e47-b80e-8e96dc593663@github.com> On Thu, 14 Jul 2022 20:46:04 GMT, Vladimir Kozlov wrote: >> CTW test (which compiles methods without running them - no profiling) failed when run with stress flag and particular RNG seed `-XX:+StressIGVN -XX:StressSeed=1743550013`. The failure is intermittent because of RNG. >> >> The compiled method [BasicPopupMenuUI$BasicMenuKeyListener::menuKeyPressed()](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L331) has allocation in loop at line [L360](https://github.com/openjdk/jdk/blob/master/src/java.desktop/share/classes/javax/swing/plaf/basic/BasicPopupMenuUI.java#L360) which is followed by call `item.isArmed()` at line L366. This call is not "linked" and uncommon trap is generated for it. As result the allocation result become un-used. >> Due to shuffling done with `StressIGVN` flag LoadRangeNode is processed after InitializeNode is removed from graph but AllocateArrayNode is not. We hit assert because of that. >> >> The fix replaces the assert with check. >> >> Tested with replay file from bug report. I was not able to reproduce failure with standalone test because it is hard to force LoadRangeNode processing at right time. I attached to bug report a test which work on. >> >> Testing tier1-3,xcomp > > Vladimir Kozlov has updated the pull request incrementally with one additional commit since the last revision: > > Return original length when InitializeNode is absent Thank you, Igor. ------------- PR: https://git.openjdk.org/jdk/pull/9497 From pli at openjdk.org Fri Jul 15 08:02:08 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 15 Jul 2022 08:02:08 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: <3m9uniZCBv2EHhcakjjuLNJAx-BvFeA-oscBDxYr5a4=.1ff9f49d-a58f-414d-aa21-ddd81bbe071e@github.com> On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. @rwestrel @TobiHartmann Would you like to review this fix for jdk19? The RDP1 will end in one week. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From pli at openjdk.org Fri Jul 15 08:16:44 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 15 Jul 2022 08:16:44 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv Message-ID: Recently we found some array range checks in loops are not hoisted by C2's loop predication phase as expected. Below is a typical case. for (int i = 0; i < size; i++) { b[3 * i] = a[3 * i]; } Ideally, C2 can hoist the range check of an array access in loop if the array index is a linear function of the loop's induction variable (iv). Say, range check in `arr[exp]` can be hoisted if exp = k1 * iv + k2 + inv where `k1` and `k2` are compile-time constants, and `inv` is an optional loop invariant. But in above case, C2 igvn does some strength reduction on the `MulINode` used to compute `3 * i`. It results in the linear index expression not being recognized. So far we found 2 ideal transformations that may affect linear expression recognition. They are - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value To avoid range check hoisting and further optimizations being broken, we have tried improving the linear recognition. But after some experiments, we found complex and recursive pattern match does not always work well. In this patch we propose to defer these 2 ideal transformations to the phase of post loop igvn. In other words, these 2 strength reductions can only be done after all loop optimizations are over. Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. We also tested the performance via JMH and see obvious improvement. Benchmark Improvement RangeCheckHoisting.ivScaled3 +21.2% RangeCheckHoisting.ivScaled7 +6.6% ------------- Commit messages: - 8289996: Fix array range check hoisting for some scaled loop iv Changes: https://git.openjdk.org/jdk/pull/9508/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9508&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289996 Stats: 164 lines in 3 files changed: 163 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9508.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9508/head:pull/9508 PR: https://git.openjdk.org/jdk/pull/9508 From jiefu at openjdk.org Fri Jul 15 09:23:35 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 15 Jul 2022 09:23:35 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX Message-ID: Hi all, Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. Thanks. Best regards, Jie ------------- Commit messages: - 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX Changes: https://git.openjdk.org/jdk/pull/9509/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9509&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289801 Stats: 3 lines in 1 file changed: 3 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9509.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9509/head:pull/9509 PR: https://git.openjdk.org/jdk/pull/9509 From duke at openjdk.org Fri Jul 15 10:44:52 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Fri, 15 Jul 2022 10:44:52 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test for the same to test the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max is available only in SSE4 (pmaxsd/pminsd are generated) and AVX version >= 1 (vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized and generates the usual cmp-cmove instructions when the loop is not vectorizable or when the max/min operations are called outside of the loop. Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With autovectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without the patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. > > This patch also implements Ideal transformations for the MaxINode which are similar to the ones defined for the MinINode to transform/optimize a couple of commonly occurring patterns such as - > MaxI(x + c0, MaxI(y + c1, z)) ==> MaxI(AddI(x, MAX2(c0, c1)), z) when x == y > MaxI(x + c0, y + c1) ==> AddI(x, MAX2(c0,c1)) when x == y Bhavana-Kilambi has updated the pull request incrementally with one additional commit since the last revision: 8288107: Auto-vectorization for integer min/max When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : Before this patch: aarch64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op x86-64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op After this patch: aarch64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op x86-64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : aarch64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op x86-64: Benchmark (length) (seed) Mode Cnt Score Error Units VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op There is no degradation when vectorization is disabled. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9466/files - new: https://git.openjdk.org/jdk/pull/9466/files/aa9b3f35..760f3b6d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9466&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9466&range=00-01 Stats: 174 lines in 4 files changed: 0 ins; 173 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9466.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9466/head:pull/9466 PR: https://git.openjdk.org/jdk/pull/9466 From duke at openjdk.org Fri Jul 15 11:16:00 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Fri, 15 Jul 2022 11:16:00 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana-Kilambi wrote: >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >>
Before this patch >> >> **aarch64:** >> ``` >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >>
>> >>
After this patch >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> >>
>> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. >> >>
Performance numbers >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> >>
>> >> There is no degradation when vectorization is disabled. > > Bhavana-Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > 8288107: Auto-vectorization for integer min/max > > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. Added a new commit with the MaxINode::Ideal tests related code stripped off and only retaining the code related to generating MinI/MaxI node for Math/min/max intrinsics. ------------- PR: https://git.openjdk.org/jdk/pull/9466 From roland at openjdk.org Fri Jul 15 11:47:04 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Jul 2022 11:47:04 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: <60mgWuJ5JrkPp3if73MrqnT1wYdvo_IXlIySw_LoT3w=.378b988d-9eeb-4443-addf-bd46cd389d38@github.com> On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk/pull/9508 From roland at openjdk.org Fri Jul 15 12:39:48 2022 From: roland at openjdk.org (Roland Westrelin) Date: Fri, 15 Jul 2022 12:39:48 GMT Subject: [jdk19] RFR: 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible Message-ID: Loop predication adds a skeleton predicate for a range check that computes: (AddL (ConvI2L (MulI (CastII (AddI (OpaqueLoopInit 0) (SubI (OpaqueLoopStride ..) 1))) 7)) ..) later transformed into: (AddL (SubL (ConvI2L (LShiftI (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..))) 3)) (AddL (ConvI2L (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..)))) -1)) ..) When pre/main/post loops are added, this expression is copied (above the main loop) and updated. The logic performing the copy relies on skeleton_follow_inputs() to find the OpaqueLoopInit nodes and clone it with an updated initial value. Note there are 2 OpaqueLoopInit nodes to update in the transformed expression. But because skeleton_follow_inputs() doesn't include LShiftI only one of the OpaqueLoopInit node is cloned with an updated initial value. After loop opts are over, the OpaqueLoopInit nodes are replaced by their input which result in this particular case in a predicate that always fails. The fix is to fix skeleton_follow_inputs() to include LShiftI. I also added verification code to catch similar issues in the future. ------------- Commit messages: - assert - test - fix Changes: https://git.openjdk.org/jdk19/pull/143/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=143&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289127 Stats: 86 lines in 2 files changed: 83 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk19/pull/143.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/143/head:pull/143 PR: https://git.openjdk.org/jdk19/pull/143 From jbhateja at openjdk.org Fri Jul 15 15:07:06 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 15 Jul 2022 15:07:06 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 21:03:30 GMT, Jatin Bhateja wrote: >> - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. >> - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8290066: Removing newly added white listed options. Hi @dean-long , @chhagedorn , can you kindly check and approve this. ------------- PR: https://git.openjdk.org/jdk/pull/9452 From kvn at openjdk.org Fri Jul 15 17:00:08 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 17:00:08 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. In-Reply-To: References: Message-ID: On Thu, 14 Jul 2022 18:23:51 GMT, Jatin Bhateja wrote: > Hi All, > > Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows > significant speedup. > > System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S) > > > Baseline: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms > > > With-opt: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms > > > Kindly review and share your feedback. > > Best Regards, > Jatin Testing tier1-3 passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9498 From kvn at openjdk.org Fri Jul 15 17:02:05 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 17:02:05 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: <7E8NvLtxP5tsewHrnP7Lii341EmYbjvhIviGJI9pow4=.6fa2544e-ff85-4d69-8ce6-0ca57e9784fd@github.com> On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% Very nice analysis. Changes looks good. Let me test it. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From kvn at openjdk.org Fri Jul 15 17:11:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 17:11:41 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% I may need to run our benchmarks too. This may affect normal unrolling code (without vectors). @pfustc, can you add to micro cases for similar loops but without vectors? I would like to see how indexes for each unrolled operation is calculated if `*` is kept. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From kvn at openjdk.org Fri Jul 15 17:23:51 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 17:23:51 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana-Kilambi wrote: >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >>
Before this patch >> >> **aarch64:** >> ``` >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >>
>> >>
After this patch >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> >>
>> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. >> >>
Performance numbers >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> >>
>> >> There is no degradation when vectorization is disabled. > > Bhavana-Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > 8288107: Auto-vectorization for integer min/max > > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. Good. You need second review. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9466 From kvn at openjdk.org Fri Jul 15 17:26:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 17:26:15 GMT Subject: [jdk19] RFR: 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 12:30:56 GMT, Roland Westrelin wrote: > Loop predication adds a skeleton predicate for a range check that computes: > > (AddL (ConvI2L (MulI (CastII (AddI (OpaqueLoopInit 0) (SubI (OpaqueLoopStride ..) 1))) 7)) ..) > > later transformed into: > > (AddL (SubL (ConvI2L (LShiftI (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..))) 3)) (AddL (ConvI2L (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..)))) -1)) ..) > > When pre/main/post loops are added, this expression is copied (above > the main loop) and updated. The logic performing the copy relies on > skeleton_follow_inputs() to find the OpaqueLoopInit nodes and clone it > with an updated initial value. Note there are 2 OpaqueLoopInit nodes > to update in the transformed expression. But because > skeleton_follow_inputs() doesn't include LShiftI only one of the > OpaqueLoopInit node is cloned with an updated initial value. > > After loop opts are over, the OpaqueLoopInit nodes are replaced by > their input which result in this particular case in a predicate that > always fails. > > The fix is to fix skeleton_follow_inputs() to include LShiftI. I also > added verification code to catch similar issues in the future. Good. I will test it. ------------- PR: https://git.openjdk.org/jdk19/pull/143 From kvn at openjdk.org Fri Jul 15 17:41:59 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 17:41:59 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 17:08:13 GMT, Vladimir Kozlov wrote: > @pfustc, can you add to micro cases for similar loops but without vectors? I would like to see how indexes for each unrolled operation is calculated if `*` is kept. I mean to compare code generation and performance for unrolled loops with and without your changes. May be not new cases but run cases you have with `-XX:-UseSuperWord` auto-vectorization off. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From coleenp at openjdk.org Fri Jul 15 18:56:26 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 15 Jul 2022 18:56:26 GMT Subject: RFR: 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" Message-ID: Bumped up the PRODUCT stack shadow pages, since if I change the assert(!in_vm) to a guarantee, I get the failure in product mode too. Tested with tier7 and failed test now passing. ------------- Commit messages: - Remove test from the problem list. - 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" Changes: https://git.openjdk.org/jdk/pull/9514/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9514&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290013 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9514.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9514/head:pull/9514 PR: https://git.openjdk.org/jdk/pull/9514 From coleenp at openjdk.org Fri Jul 15 18:56:26 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Fri, 15 Jul 2022 18:56:26 GMT Subject: RFR: 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 12:31:44 GMT, Coleen Phillimore wrote: > Bumped up the PRODUCT stack shadow pages, since if I change the assert(!in_vm) to a guarantee, I get the failure in product mode too. > Tested with tier7 and failed test now passing. > ?? 8290013 is used in problem lists: [test/hotspot/jtreg/ProblemList-Xcomp.txt] Thank you Skara bots. Rerunning tier7 with the test removed from the problem list. I removed the test from the problem list and reran tier7, where it was failing with no failures. ------------- PR: https://git.openjdk.org/jdk/pull/9514 From lmesnik at openjdk.org Fri Jul 15 19:33:02 2022 From: lmesnik at openjdk.org (Leonid Mesnik) Date: Fri, 15 Jul 2022 19:33:02 GMT Subject: RFR: 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" In-Reply-To: References: Message-ID: <5ucIO9XcRC5ZgjvXe9mypA9YghXf_gr-nFoyLEfTq4c=.09a8b91e-e4ec-4d19-bc4a-b449bb60462e@github.com> On Fri, 15 Jul 2022 12:31:44 GMT, Coleen Phillimore wrote: > Bumped up the PRODUCT stack shadow pages, since if I change the assert(!in_vm) to a guarantee, I get the failure in product mode too. > Tested with tier7 and failed test now passing. Marked as reviewed by lmesnik (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9514 From kvn at openjdk.org Fri Jul 15 19:41:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 15 Jul 2022 19:41:07 GMT Subject: [jdk19] RFR: 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 12:30:56 GMT, Roland Westrelin wrote: > Loop predication adds a skeleton predicate for a range check that computes: > > (AddL (ConvI2L (MulI (CastII (AddI (OpaqueLoopInit 0) (SubI (OpaqueLoopStride ..) 1))) 7)) ..) > > later transformed into: > > (AddL (SubL (ConvI2L (LShiftI (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..))) 3)) (AddL (ConvI2L (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..)))) -1)) ..) > > When pre/main/post loops are added, this expression is copied (above > the main loop) and updated. The logic performing the copy relies on > skeleton_follow_inputs() to find the OpaqueLoopInit nodes and clone it > with an updated initial value. Note there are 2 OpaqueLoopInit nodes > to update in the transformed expression. But because > skeleton_follow_inputs() doesn't include LShiftI only one of the > OpaqueLoopInit node is cloned with an updated initial value. > > After loop opts are over, the OpaqueLoopInit nodes are replaced by > their input which result in this particular case in a predicate that > always fails. > > The fix is to fix skeleton_follow_inputs() to include LShiftI. I also > added verification code to catch similar issues in the future. Testing results are good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/143 From dlong at openjdk.org Fri Jul 15 21:01:02 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 15 Jul 2022 21:01:02 GMT Subject: [jdk19] RFR: 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 12:30:56 GMT, Roland Westrelin wrote: > Loop predication adds a skeleton predicate for a range check that computes: > > (AddL (ConvI2L (MulI (CastII (AddI (OpaqueLoopInit 0) (SubI (OpaqueLoopStride ..) 1))) 7)) ..) > > later transformed into: > > (AddL (SubL (ConvI2L (LShiftI (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..))) 3)) (AddL (ConvI2L (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..)))) -1)) ..) > > When pre/main/post loops are added, this expression is copied (above > the main loop) and updated. The logic performing the copy relies on > skeleton_follow_inputs() to find the OpaqueLoopInit nodes and clone it > with an updated initial value. Note there are 2 OpaqueLoopInit nodes > to update in the transformed expression. But because > skeleton_follow_inputs() doesn't include LShiftI only one of the > OpaqueLoopInit node is cloned with an updated initial value. > > After loop opts are over, the OpaqueLoopInit nodes are replaced by > their input which result in this particular case in a predicate that > always fails. > > The fix is to fix skeleton_follow_inputs() to include LShiftI. I also > added verification code to catch similar issues in the future. Based on the description of the problem, the fix looks good. ------------- Marked as reviewed by dlong (Reviewer). PR: https://git.openjdk.org/jdk19/pull/143 From duke at openjdk.org Fri Jul 15 23:08:20 2022 From: duke at openjdk.org (Cesar Soares) Date: Fri, 15 Jul 2022 23:08:20 GMT Subject: RFR: 8289943: Simplify some object allocation merges In-Reply-To: References: Message-ID: On Thu, 14 Jul 2022 23:32:32 GMT, Xin Liu wrote: >> Hi there, can I please get some feedback on this approach to simplify object allocation merges in order to promote Scalar Replacement of the objects involved in the merge? >> >> The basic idea for this [approach was discussed in this thread](https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2022-April/055189.html) and it consists of: >> 1) Identify Phi nodes that merge object allocations and replace them with a new IR node called ReducedAllocationMergeNode (RAM node). >> 2) Scalar Replace the incoming allocations to the RAM node. >> 3) Scalar Replace the RAM node itself. >> >> There are a few conditions for doing the replacement of the Phi by a RAM node though - Although I plan to work on removing them in subsequent PRs: >> >> - ~~The original Phi node should be merging Allocate nodes in all inputs.~~ >> - The only supported users of the original Phi are AddP->Load, SafePoints/Traps, DecodeN. >> >> These are the critical parts of the implementation and I'd appreciate it very much if you could tell me if what I implemented isn't violating any C2 IR constraints: >> >> - The way I identify/use the memory edges that will be used to find the last stored values to the merged object fields. >> - The way I check if there is an incoming Allocate node to the original Phi node. >> - The way I check if there is no store to the merged objects after they are merged. >> >> Testing: >> - Linux. fastdebug -> hotspot_all, renaissance, dacapo > > I run into an error > > $make test TEST=compiler/c2/irTests/scalarReplacement/AllocationMergesTests.java CONF=linux-x86_64-server-fastdebug > > > One or more @IR rules failed: > > Failed IR Rules (1) of Methods (1) > ---------------------------------- > 1) Method "int compiler.c2.irTests.scalarReplacement.AllocationMergesTests.testPollutedPolymorphic(boolean,int)" - [Failed IR rules: 1]: > * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIf={}, applyIfAnd={}, failOn={}, applyIfOr={}, counts={"(.*precise .*\\R((.*(?i:mov|xorl|nop|spill).*|\\s*|.*LGHI.*)\\R)*.*(?i:call,static).*wrapper for: _new_instance_Java)", "2"}, applyIfNot={})" > - counts: Graph contains wrong number of nodes: > * Regex 1: (.*precise .*\R((.*(?i:mov|xorl|nop|spill).*|\s*|.*LGHI.*)\R)*.*(?i:call,static).*wrapper for: _new_instance_Java) > - Failed comparison: [found] 0 = 2 [given] > - No nodes matched! > >>>> Check stdout for compilation output of the failed methods > > > ############################################################# > - To only run the failed tests use -DTest, -DExclude, > and/or -DScenarios. > - To also get the standard output of the test VM run with > -DReportStdout=true or for even more fine-grained logging > use -DVerbose=true. > ############################################################# > > > compiler.lib.ir_framework.driver.irmatching.IRViolationException: There were one or multiple IR rule failures. Please check stderr for more information. > at compiler.lib.ir_framework.driver.irmatching.IRMatcher.throwIfNoSafepointWhilePrinting(IRMatcher.java:91) > at compiler.lib.ir_framework.driver.irmatching.IRMatcher.reportFailures(IRMatcher.java:82) > at compiler.lib.ir_framework.driver.irmatching.IRMatcher.applyIRRules(IRMatcher.java:54) > at compiler.lib.ir_framework.driver.irmatching.IRMatcher.(IRMatcher.java:43) > at compiler.lib.ir_framework.TestFramework.runTestVM(TestFramework.java:729) > at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:698) > at compiler.lib.ir_framework.TestFramework.start(TestFramework.java:329) > at compiler.lib.ir_framework.TestFramework.runWithFlags(TestFramework.java:237) > at compiler.c2.irTests.scalarReplacement.AllocationMergesTests.main(AllocationMergesTests.java:37) > at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) > at java.base/java.lang.reflect.Method.invoke(Method.java:578) > at com.sun.javatest.regtest.agent.MainActionHelper$AgentVMRunnable.run(MainActionHelper.java:312) > at java.base/java.lang.Thread.run(Thread.java:1596) > > > 2 objects do get scalarized in `testPollutedPolymorphic`. Thanks for taking a look @navyxliu . I'm working on an improvement (and fixing a bug) and I'll include your suggestions in the next push. > src/hotspot/share/opto/escape.cpp line 647: > >> 645: _igvn->_worklist.push(ram); >> 646: >> 647: // if (n->_idx == 257 && ram->_idx == 1239) { > > some mysterious code? :) Oops (no pun intended). ------------- PR: https://git.openjdk.org/jdk/pull/9073 From kvn at openjdk.org Sat Jul 16 00:05:01 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 16 Jul 2022 00:05:01 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 09:13:04 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. > > Thanks. > Best regards, > Jie Consider adding IR tests to vector testing group which we (Oracle) run with different AVX,SSE settings: test/hotspot/jtreg/TEST.groups @@ -84,6 +84,7 @@ hotspot_containers_extended = \ hotspot_vector_1 = \ compiler/c2/cr6340864 \ + compiler/c2/irTests \ compiler/codegen \ compiler/loopopts/superword \ compiler/vectorapi \ I submitted testing with this additional change. ------------- PR: https://git.openjdk.org/jdk/pull/9509 From sviswanathan at openjdk.org Sat Jul 16 01:04:03 2022 From: sviswanathan at openjdk.org (Sandhya Viswanathan) Date: Sat, 16 Jul 2022 01:04:03 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 21:03:30 GMT, Jatin Bhateja wrote: >> - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. >> - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). >> >> Please review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8290066: Removing newly added white listed options. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.org/jdk/pull/9452 From jbhateja at openjdk.org Sat Jul 16 01:20:08 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 16 Jul 2022 01:20:08 GMT Subject: RFR: 8290066: Remove KNL specific handling for new CPU target check in IR annotation [v2] In-Reply-To: References: Message-ID: On Sat, 16 Jul 2022 01:00:12 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8290066: Removing newly added white listed options. > > Looks good to me. Thanks @sviswa7 , @vnkozlov ------------- PR: https://git.openjdk.org/jdk/pull/9452 From jbhateja at openjdk.org Sat Jul 16 01:22:10 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 16 Jul 2022 01:22:10 GMT Subject: Integrated: 8290066: Remove KNL specific handling for new CPU target check in IR annotation In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 12:55:02 GMT, Jatin Bhateja wrote: > - Newly added annotations query the CPU feature using white box API which returns the list of features enabled during VM initialization. > - With JVM flag UseKNLSetting, during VM initialization AVX512 features not supported by KNL target are disabled, thus we do not need any special handling for KNL in newly introduced IR annotations (applyCPUFeature, applyCPUFeatureOr, applyCPUFeatureAnd). > > Please review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 2342684f Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/2342684f2cd91a2e5f43dd271e95836aa78e7d0a Stats: 190 lines in 4 files changed: 81 ins; 108 del; 1 mod 8290066: Remove KNL specific handling for new CPU target check in IR annotation Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.org/jdk/pull/9452 From kvn at openjdk.org Sat Jul 16 14:40:49 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 16 Jul 2022 14:40:49 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX In-Reply-To: References: Message-ID: <-maCsV9lcGwTwpA8ynSFwvX7mQ0X6AVYzahGBFkVpDo=.1276717e-e680-43c7-ada4-e1b467ada88a@github.com> On Fri, 15 Jul 2022 09:13:04 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. > > Thanks. > Best regards, > Jie compiler/c2/irTests/TestVectorizeURShiftSubword.java test failed. Details in RFE. ------------- PR: https://git.openjdk.org/jdk/pull/9509 From kvn at openjdk.org Sat Jul 16 16:23:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 16 Jul 2022 16:23:44 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% Performance results are neutral - no change in benchmarks results. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9508 From duke at openjdk.org Sun Jul 17 06:58:09 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Sun, 17 Jul 2022 06:58:09 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% src/hotspot/share/opto/mulnode.cpp line 277: > 275: phase->C->record_for_post_loop_opts_igvn(this); > 276: return NULL; > 277: } This defers the whole idealisation, should we only skip this particular transformation instead? Also should this be applied to `MulLNode`, too? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From pli at openjdk.org Sun Jul 17 08:13:59 2022 From: pli at openjdk.org (Pengfei Li) Date: Sun, 17 Jul 2022 08:13:59 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Sat, 16 Jul 2022 16:20:33 GMT, Vladimir Kozlov wrote: > Performance results are neutral - no change in benchmarks results. Good. Yes, current superword in C2 cannot vectorize this case because memory references after unrolling are not adjacent. We have plans to support it (at least for AArch64) in the future but for now the improvement in this case does not come from auto-vectorization. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From pli at openjdk.org Sun Jul 17 08:32:01 2022 From: pli at openjdk.org (Pengfei Li) Date: Sun, 17 Jul 2022 08:32:01 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 06:54:30 GMT, Quan Anh Mai wrote: >> Recently we found some array range checks in loops are not hoisted by >> C2's loop predication phase as expected. Below is a typical case. >> >> for (int i = 0; i < size; i++) { >> b[3 * i] = a[3 * i]; >> } >> >> Ideally, C2 can hoist the range check of an array access in loop if the >> array index is a linear function of the loop's induction variable (iv). >> Say, range check in `arr[exp]` can be hoisted if >> >> exp = k1 * iv + k2 + inv >> >> where `k1` and `k2` are compile-time constants, and `inv` is an optional >> loop invariant. But in above case, C2 igvn does some strength reduction >> on the `MulINode` used to compute `3 * i`. It results in the linear index >> expression not being recognized. So far we found 2 ideal transformations >> that may affect linear expression recognition. They are >> >> - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values >> - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value >> >> To avoid range check hoisting and further optimizations being broken, we >> have tried improving the linear recognition. But after some experiments, >> we found complex and recursive pattern match does not always work well. >> In this patch we propose to defer these 2 ideal transformations to the >> phase of post loop igvn. In other words, these 2 strength reductions can >> only be done after all loop optimizations are over. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. >> We also tested the performance via JMH and see obvious improvement. >> >> Benchmark Improvement >> RangeCheckHoisting.ivScaled3 +21.2% >> RangeCheckHoisting.ivScaled7 +6.6% > > src/hotspot/share/opto/mulnode.cpp line 277: > >> 275: phase->C->record_for_post_loop_opts_igvn(this); >> 276: return NULL; >> 277: } > > This defers the whole idealisation, should we only skip this particular transformation instead? Also should this be applied to `MulLNode`, too? Thanks. I think this only defers the transformation if scale value is a constant and has exactly 2 bits set in binary. Could you elaborate how to make it more particular? AFAIK, all normal Java array accesses are using 32-bit indices. When running on a 64-bit platform, we use a `ConvI2L` in element address computing but it's done after whole index expression computing. Is there any special array accesses that may use `MulL`? ------------- PR: https://git.openjdk.org/jdk/pull/9508 From jiefu at openjdk.org Sun Jul 17 10:15:52 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 17 Jul 2022 10:15:52 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: > Hi all, > > Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. > > Thanks. > Best regards, > Jie Jie Fu has updated the pull request incrementally with one additional commit since the last revision: Address review comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9509/files - new: https://git.openjdk.org/jdk/pull/9509/files/00efdca7..823699a5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9509&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9509&range=00-01 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9509.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9509/head:pull/9509 PR: https://git.openjdk.org/jdk/pull/9509 From jiefu at openjdk.org Sun Jul 17 10:15:52 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 17 Jul 2022 10:15:52 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: On Sat, 16 Jul 2022 00:01:13 GMT, Vladimir Kozlov wrote: > Consider adding IR tests to vector testing group which we (Oracle) run with different AVX,SSE settings: IR tests had been added into vector testing. ------------- PR: https://git.openjdk.org/jdk/pull/9509 From jiefu at openjdk.org Sun Jul 17 10:18:18 2022 From: jiefu at openjdk.org (Jie Fu) Date: Sun, 17 Jul 2022 10:18:18 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX In-Reply-To: <-maCsV9lcGwTwpA8ynSFwvX7mQ0X6AVYzahGBFkVpDo=.1276717e-e680-43c7-ada4-e1b467ada88a@github.com> References: <-maCsV9lcGwTwpA8ynSFwvX7mQ0X6AVYzahGBFkVpDo=.1276717e-e680-43c7-ada4-e1b467ada88a@github.com> Message-ID: On Sat, 16 Jul 2022 14:37:04 GMT, Vladimir Kozlov wrote: > compiler/c2/irTests/TestVectorizeURShiftSubword.java test failed. Details in RFE. Thanks @vnkozlov for the review and testing. I can only reproduce the failure with `UseSSE < 4`, but passed with UseAVX=0 on x86. The reason is that `RShiftVB` isn't supported with `UseSSE < 4` on x86. The fix just skips the test when `UseSSE < 4` on x86. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9509 From duke at openjdk.org Sun Jul 17 16:57:08 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Sun, 17 Jul 2022 16:57:08 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 08:28:41 GMT, Pengfei Li wrote: >> src/hotspot/share/opto/mulnode.cpp line 277: >> >>> 275: phase->C->record_for_post_loop_opts_igvn(this); >>> 276: return NULL; >>> 277: } >> >> This defers the whole idealisation, should we only skip this particular transformation instead? Also should this be applied to `MulLNode`, too? Thanks. > > I think this only defers the transformation if scale value is a constant and has exactly 2 bits set in binary. Could you elaborate how to make it more particular? > > AFAIK, all normal Java array accesses are using 32-bit indices. When running on a 64-bit platform, we use a `ConvI2L` in element address computing but it's done after whole index expression computing. Is there any special array accesses that may use `MulL`? I mean this stops the whole idealisation as soon as the constant has exactly 2 bits set, I think we should still try other transformations in `MulNode::Ideal` in those cases. IIRC, memory segment accesses use long arithmetic, so they need this changes, too. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From xgong at openjdk.org Mon Jul 18 02:47:03 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 18 Jul 2022 02:47:03 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 10:15:52 GMT, Jie Fu wrote: >> Hi all, >> >> Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Looks good to me! Thanks! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/9509 From fgao at openjdk.org Mon Jul 18 05:58:49 2022 From: fgao at openjdk.org (Fei Gao) Date: Mon, 18 Jul 2022 05:58:49 GMT Subject: Integrated: 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 In-Reply-To: References: Message-ID: On Wed, 6 Jul 2022 07:51:01 GMT, Fei Gao wrote: > Superword doesn't vectorize any nodes of non-primitive types and > thus sets `allow_address` false when calling type2aelembytes() in > SuperWord::data_size()[1]. Therefore, when we try to resolve the > data size for a node of T_ADDRESS type, the assertion in > type2aelembytes()[2] takes effect. > > We try to resolve the data sizes for node s and node t in the > SuperWord::adjust_alignment_for_type_conversion()[3] when type > conversion between different data sizes happens. The issue is, > when node s is a ConvI2L node and node t is an AddP node of > T_ADDRESS type, type2aelembytes() will assert. To fix it, we > should filter out all non-primitive nodes, like the patch does > in SuperWord::adjust_alignment_for_type_conversion(). Since > it's a failure in the mid-end, all superword available platforms > are affected. In my local test, this failure can be reproduced > on both x86 and aarch64. With this patch, the failure can be fixed. > > Apart from fixing the bug, the patch also adds necessary type check > and does some clean-up in SuperWord::longer_type_for_conversion() > and VectorCastNode::implemented(). > > [1]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1417 > [2]https://github.com/openjdk/jdk/blob/b96ba19807845739b36274efb168dd048db819a3/src/hotspot/share/utilities/globalDefinitions.cpp#L326 > [3]https://github.com/openjdk/jdk/blob/dddd4e7c81fccd82b0fd37ea4583ce1a8e175919/src/hotspot/share/opto/superword.cpp#L1454 This pull request has now been integrated. Changeset: 87340fd5 Author: Fei Gao Committer: Ningsheng Jian URL: https://git.openjdk.org/jdk/commit/87340fd5408d89d9343541ff4fcabde83548a598 Stats: 116 lines in 5 files changed: 89 ins; 9 del; 18 mod 8288883: C2: assert(allow_address || t != T_ADDRESS) failed after JDK-8283091 Reviewed-by: kvn, mdoerr ------------- PR: https://git.openjdk.org/jdk/pull/9391 From rrich at openjdk.org Mon Jul 18 06:41:57 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Mon, 18 Jul 2022 06:41:57 GMT Subject: RFR: 8289925 Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: > This removes the reference to the platform specific method `frame::interpreter_frame_last_sp()` from the shared method `Continuation::continuation_bottom_sender()`. > > The change simply removes the special case for interpreted frames as I cannot see a reason for the distinction between interpreted and compiled frames. > > Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' - Remove platform dependent method interpreter_frame_last_sp() from shared code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9411/files - new: https://git.openjdk.org/jdk/pull/9411/files/54d7db65..c3ad382c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9411&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9411&range=00-01 Stats: 84184 lines in 1882 files changed: 43512 ins; 26168 del; 14504 mod Patch: https://git.openjdk.org/jdk/pull/9411.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9411/head:pull/9411 PR: https://git.openjdk.org/jdk/pull/9411 From roland at openjdk.org Mon Jul 18 07:10:06 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Jul 2022 07:10:06 GMT Subject: [jdk19] RFR: 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 19:37:50 GMT, Vladimir Kozlov wrote: >> Loop predication adds a skeleton predicate for a range check that computes: >> >> (AddL (ConvI2L (MulI (CastII (AddI (OpaqueLoopInit 0) (SubI (OpaqueLoopStride ..) 1))) 7)) ..) >> >> later transformed into: >> >> (AddL (SubL (ConvI2L (LShiftI (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..))) 3)) (AddL (ConvI2L (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..)))) -1)) ..) >> >> When pre/main/post loops are added, this expression is copied (above >> the main loop) and updated. The logic performing the copy relies on >> skeleton_follow_inputs() to find the OpaqueLoopInit nodes and clone it >> with an updated initial value. Note there are 2 OpaqueLoopInit nodes >> to update in the transformed expression. But because >> skeleton_follow_inputs() doesn't include LShiftI only one of the >> OpaqueLoopInit node is cloned with an updated initial value. >> >> After loop opts are over, the OpaqueLoopInit nodes are replaced by >> their input which result in this particular case in a predicate that >> always fails. >> >> The fix is to fix skeleton_follow_inputs() to include LShiftI. I also >> added verification code to catch similar issues in the future. > > Testing results are good. @vnkozlov @dean-long thanks for the reviews ------------- PR: https://git.openjdk.org/jdk19/pull/143 From roland at openjdk.org Mon Jul 18 07:12:19 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Jul 2022 07:12:19 GMT Subject: [jdk19] Integrated: 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 12:30:56 GMT, Roland Westrelin wrote: > Loop predication adds a skeleton predicate for a range check that computes: > > (AddL (ConvI2L (MulI (CastII (AddI (OpaqueLoopInit 0) (SubI (OpaqueLoopStride ..) 1))) 7)) ..) > > later transformed into: > > (AddL (SubL (ConvI2L (LShiftI (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..))) 3)) (AddL (ConvI2L (CastII (AddI (OpaqueLoopInit 0) (OpaqueLoopStride ..)))) -1)) ..) > > When pre/main/post loops are added, this expression is copied (above > the main loop) and updated. The logic performing the copy relies on > skeleton_follow_inputs() to find the OpaqueLoopInit nodes and clone it > with an updated initial value. Note there are 2 OpaqueLoopInit nodes > to update in the transformed expression. But because > skeleton_follow_inputs() doesn't include LShiftI only one of the > OpaqueLoopInit node is cloned with an updated initial value. > > After loop opts are over, the OpaqueLoopInit nodes are replaced by > their input which result in this particular case in a predicate that > always fails. > > The fix is to fix skeleton_follow_inputs() to include LShiftI. I also > added verification code to catch similar issues in the future. This pull request has now been integrated. Changeset: 4f3f74c1 Author: Roland Westrelin URL: https://git.openjdk.org/jdk19/commit/4f3f74c14121d0a80f0dcf1d593b4cf1c3e4a64c Stats: 86 lines in 2 files changed: 83 ins; 0 del; 3 mod 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.org/jdk19/pull/143 From haosun at openjdk.org Mon Jul 18 07:54:41 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 18 Jul 2022 07:54:41 GMT Subject: RFR: 8290169: adlc: Improve child constraints for vector unary operations Message-ID: As demonstrated in [1], the child constrait generated for *predicated vector unary operation* is the super set of that generated for the *unpredicated* version. As a result, there exists a risk for predicated vector unary operaions to match the unpredicated rules by accident. In this patch, we resolve this issue by generating one extra check "rChild == NULL" ONLY for vector unary operations. In this way, the child constraints for predicated/unpredicated vector unary operations are exclusive now. Following the example in [1], the dfa state generated for AbsVI is shown below. void State::_sub_Op_AbsVI(const Node *n){ if( STATE__VALID_CHILD(_kids[0], VREG) && STATE__VALID_CHILD(_kids[1], PREGGOV) && ( UseSVE > 0 ) ) { unsigned int c = _kids[0]->_cost[VREG]+_kids[1]->_cost[PREGGOV] + SVE_COST; DFA_PRODUCTION(VREG, vabsI_masked_rule, c) } if( STATE__VALID_CHILD(_kids[0], VREG) && _kids[1] == NULL && <---- 1 ( UseSVE > 0) ) { unsigned int c = _kids[0]->_cost[VREG] + SVE_COST; if (STATE__NOT_YET_VALID(VREG) || _cost[VREG] > c) { DFA_PRODUCTION(VREG, vabsI_rule, c) } } ... We can see that the constraint at line 1 cannot be matched for predicated AbsVI node now. The main updates are made in adlc/dfa part. Ideally, we should only add the extra check for affected platforms, i.e. AVX-512 and SVE. But we didn't do that because it would be better not to introduce any architecture dependent implementation here. Besides, workarounds in both aarch64_sve.ad and x86.ad are removed. 1) Many "is_predicated_vector()" checks can be removed in aarch64_sve.ad file. 2) Default instruction cost is used for involving rules in x86.ad file. [1]. https://github.com/shqking/jdk/commit/50ec9b19 ------------- Commit messages: - 8290169: adlc: Improve child constraints for vector unary operations Changes: https://git.openjdk.org/jdk/pull/9534/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9534&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290169 Stats: 192 lines in 5 files changed: 28 ins; 97 del; 67 mod Patch: https://git.openjdk.org/jdk/pull/9534.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9534/head:pull/9534 PR: https://git.openjdk.org/jdk/pull/9534 From jbhateja at openjdk.org Mon Jul 18 08:09:51 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 18 Jul 2022 08:09:51 GMT Subject: RFR: 8290034: Auto vectorize reverse bit operations. Message-ID: Summary of changes: - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. - Handle auto-vectorization of Integer/Long.reverse bit operations. - Backend implementation for these were added with 4th incubation of VectorAPIs. Following are performance number for newly added JMH mocro benchmarks:- No-GFNI(CLX): ============= Baseline: Benchmark (size) Mode Cnt Score Error Units Integers.reverse 500 avgt 2 1.085 us/op Longs.reverse 500 avgt 2 1.236 us/op WithOpt: Benchmark (size) Mode Cnt Score Error Units Integers.reverse 500 avgt 2 0.104 us/op Longs.reverse 500 avgt 2 0.255 us/op With-GFNI(ICX): =============== Baseline: Benchmark (size) Mode Cnt Score Error Units Integers.reverse 500 avgt 2 0.887 us/op Longs.reverse 500 avgt 2 1.095 us/op Without: Benchmark (size) Mode Cnt Score Error Units Integers.reverse 500 avgt 2 0.037 us/op Longs.reverse 500 avgt 2 0.145 us/op Kindly review and share feedback. Best Regards, Jatin ------------- Commit messages: - 8290034: Adding descriptive comments. - 8290034: Auto vectorize reverse bit operations. Changes: https://git.openjdk.org/jdk/pull/9535/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9535&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290034 Stats: 425 lines in 18 files changed: 425 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9535.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9535/head:pull/9535 PR: https://git.openjdk.org/jdk/pull/9535 From roland at openjdk.org Mon Jul 18 08:28:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Jul 2022 08:28:03 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: <06fBlXzg44r8kkvZoLU0mq_z67T0E0r_cK6tkDhoDj4=.e1ebd242-fb19-48b8-ac7b-cddd841430d9@github.com> On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. To have this fixed in jdk 19, you need to open a PR againsts jdk 19. Isn't SuperWord::unrolling_analysis() only called for main and rce post loops anyway? What loop type doesn't the extra check exclude in the case of your test? ------------- PR: https://git.openjdk.org/jdk19/pull/130 From thartmann at openjdk.org Mon Jul 18 08:32:09 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 18 Jul 2022 08:32:09 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk19/pull/130 From thartmann at openjdk.org Mon Jul 18 08:32:10 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 18 Jul 2022 08:32:10 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: <06fBlXzg44r8kkvZoLU0mq_z67T0E0r_cK6tkDhoDj4=.e1ebd242-fb19-48b8-ac7b-cddd841430d9@github.com> References: <06fBlXzg44r8kkvZoLU0mq_z67T0E0r_cK6tkDhoDj4=.e1ebd242-fb19-48b8-ac7b-cddd841430d9@github.com> Message-ID: On Mon, 18 Jul 2022 08:24:53 GMT, Roland Westrelin wrote: > To have this fixed in jdk 19, you need to open a PR againsts jdk 19. But this is a PR against JDK 19, right? ------------- PR: https://git.openjdk.org/jdk19/pull/130 From roland at openjdk.org Mon Jul 18 08:37:03 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Jul 2022 08:37:03 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: <06fBlXzg44r8kkvZoLU0mq_z67T0E0r_cK6tkDhoDj4=.e1ebd242-fb19-48b8-ac7b-cddd841430d9@github.com> Message-ID: On Mon, 18 Jul 2022 08:28:45 GMT, Tobias Hartmann wrote: > To have this fixed in jdk 19, you need to open a PR againsts jdk 19. Sorry I missed this was indeed against jdk 19. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From pli at openjdk.org Mon Jul 18 09:22:58 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 18 Jul 2022 09:22:58 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: <7VdeD4p7xrJSo767f3ELfvhrDL1rUPuhX9OABHzvgYU=.1b127345-8777-4a72-a930-c6b52d0b328f@github.com> On Sun, 17 Jul 2022 16:53:23 GMT, Quan Anh Mai wrote: >> I think this only defers the transformation if scale value is a constant and has exactly 2 bits set in binary. Could you elaborate how to make it more particular? >> >> AFAIK, all normal Java array accesses are using 32-bit indices. When running on a 64-bit platform, we use a `ConvI2L` in element address computing but it's done after whole index expression computing. Is there any special array accesses that may use `MulL`? > > I mean this stops the whole idealisation as soon as the constant has exactly 2 bits set, I think we should still try other transformations in `MulNode::Ideal` in those cases. > > IIRC, memory segment accesses use long arithmetic, so they need this changes, too. > > Thanks. Thanks for your suggestions. I will update the code later after some tests. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From pli at openjdk.org Mon Jul 18 09:23:01 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 18 Jul 2022 09:23:01 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: <06fBlXzg44r8kkvZoLU0mq_z67T0E0r_cK6tkDhoDj4=.e1ebd242-fb19-48b8-ac7b-cddd841430d9@github.com> Message-ID: On Mon, 18 Jul 2022 08:33:50 GMT, Roland Westrelin wrote: > To have this fixed in jdk 19, you need to open a PR againsts jdk 19. Yeah, this is indeed against jdk 19. > Isn't SuperWord::unrolling_analysis() only called for main and rce post loops anyway? What loop type doesn't the extra check exclude in the case of your test? It's a normal loop. I find the check is called from `IdealLoopTree::policy_unroll_slp_analysis()`. And that function is called from `IdealLoopTree::policy_unroll()` which accepts normal loops. I see the assert in `IdealLoopTree::policy_unroll()` says assert(cl->is_normal_loop() || cl->is_main_loop(), ""); We probably need another fix to avoid this. But to reduce risks, I still propose we just restore the incorrectly updated code in this PR for jdk 19 and do complete fix in jdk 20. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From shade at openjdk.org Mon Jul 18 09:54:17 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 18 Jul 2022 09:54:17 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v5] In-Reply-To: References: Message-ID: On Tue, 5 Jul 2022 08:12:23 GMT, Andrew Haley wrote: >> All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. >> >> Here's an example of what was happening: >> >> ` rax->encoding();` >> >> Where rax is defined as `(Register *)0`. >> >> This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. >> >> >> typedef const RegisterImpl* Register; >> extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; >> inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; >> inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } >> constexpr Register rax = as_register(0); > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Delete changes to hotspot/shared. Looks okay to me, with minor nits. src/hotspot/cpu/x86/register_x86.cpp line 47: > 45: KRegisterImpl::max_slots_per_register * KRegisterImpl::number_of_registers; > 46: > 47: const char * RegisterImpl::name() const { Suggestion: const char* RegisterImpl::name() const { src/hotspot/cpu/x86/register_x86.hpp line 170: > 168: int raw_encoding() const { return this - first(); } > 169: int encoding() const { assert(is_valid(), "invalid register"); return raw_encoding(); } > 170: bool is_valid() const { return 0 <= raw_encoding() && raw_encoding() < number_of_registers; } Suggestion: bool is_valid() const { return 0 <= raw_encoding() && raw_encoding() < number_of_registers; } ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/9261 From thartmann at openjdk.org Mon Jul 18 10:29:04 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 18 Jul 2022 10:29:04 GMT Subject: RFR: JDK-8290069: IGV: Highlight both graphs of difference in outline In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 13:10:37 GMT, Tobias Holenstein wrote: > Previously, IGV highlighted only one graph in the outline when a difference graph is selected using the sliders. > Now, IGV highlights both graphs used to calculate the difference graph when they are in the same group. > > highlight both graphs > > IGV colors the nodes in a difference graph with yellow/red/green to highlight the changes. This only worked if the difference graph is calculated using the sliders. Now, difference graphs is also coloured when calculated via the context menu "Difference to current graph" in the outline. > > Show colors Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9468 From thartmann at openjdk.org Mon Jul 18 11:13:59 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 18 Jul 2022 11:13:59 GMT Subject: RFR: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window In-Reply-To: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> References: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> Message-ID: On Tue, 12 Jul 2022 13:48:16 GMT, Tobias Holenstein wrote: > A graph in IGV can be moved by dragging it with the left mouse button (called panning). > ![panning](https://user-images.githubusercontent.com/71546117/178509416-24dd900f-131b-484b-af47-c7a78e791434.png) > > If the mouse left the visible window of the graph during dragging, the diagram started to move in the opposite direction. This was annoying. Now panning stops as soon as the mouse leaves the window. > ![stop reverse panning](https://user-images.githubusercontent.com/71546117/178509309-3df03b7a-ada4-45a3-b9a7-d6e10664033d.png) > > In selection mode, the graph still moves when the mouse is dragged outside the window, as this is meant to make a larger selection. > ![keep panning for selection](https://user-images.githubusercontent.com/71546117/178509302-74fa41d2-e611-40a3-b6b0-c937ef4b2462.png) Changes requested by thartmann (Reviewer). src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomizablePanAction.java line 149: > 147: scrollPane = null; > 148: } > 149: return state ? State.REJECTED : State.REJECTED; That does not look correct. ------------- PR: https://git.openjdk.org/jdk/pull/9470 From thartmann at openjdk.org Mon Jul 18 11:18:09 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 18 Jul 2022 11:18:09 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 10:15:52 GMT, Jie Fu wrote: >> Hi all, >> >> Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Looks good to me. I'll run testing and report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9509 From roland at openjdk.org Mon Jul 18 11:24:54 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Jul 2022 11:24:54 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: <5zf2d9bWEataJ-64cm-qt8Ml1hCyz6DVYcK4RR2sxG0=.e0601b61-8ad1-49b7-9641-269309eabb8e@github.com> On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.org/jdk19/pull/130 From roland at openjdk.org Mon Jul 18 11:24:56 2022 From: roland at openjdk.org (Roland Westrelin) Date: Mon, 18 Jul 2022 11:24:56 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: <06fBlXzg44r8kkvZoLU0mq_z67T0E0r_cK6tkDhoDj4=.e1ebd242-fb19-48b8-ac7b-cddd841430d9@github.com> Message-ID: <2ijednHX3IN0rr3E7tDmy1jeDCWMdnbXxpI1bnlAEt4=.af1f2416-a607-41f2-842e-e60a07a25a55@github.com> On Mon, 18 Jul 2022 09:19:28 GMT, Pengfei Li wrote: > We probably need another fix to avoid this. But to reduce risks, I still propose we just restore the incorrectly updated code in this PR for jdk 19 and do complete fix in jdk 20. That's reasonable. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From aph at openjdk.org Mon Jul 18 12:08:57 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 18 Jul 2022 12:08:57 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v6] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/cpu/x86/register_x86.cpp Co-authored-by: Aleksey Shipil?v ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/df457ba6..3e2582da Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=04-05 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From pli at openjdk.org Mon Jul 18 12:21:35 2022 From: pli at openjdk.org (Pengfei Li) Date: Mon, 18 Jul 2022 12:21:35 GMT Subject: [jdk19] Integrated: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. This pull request has now been integrated. Changeset: 2677dd6d Author: Pengfei Li URL: https://git.openjdk.org/jdk19/commit/2677dd6d2318afb4afffde46f8e8e20276cb2894 Stats: 89 lines in 2 files changed: 88 ins; 0 del; 1 mod 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 Reviewed-by: kvn, thartmann, roland ------------- PR: https://git.openjdk.org/jdk19/pull/130 From aph at openjdk.org Mon Jul 18 15:04:54 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 18 Jul 2022 15:04:54 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v7] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: 8289046: Undefined Behaviour in x86 class Assembler ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/3e2582da..2375704b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=05-06 Stats: 39 lines in 2 files changed: 38 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From coleenp at openjdk.org Mon Jul 18 15:08:06 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 18 Jul 2022 15:08:06 GMT Subject: RFR: 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" In-Reply-To: References: Message-ID: <28EK8qwGGBXxUDtTvRoF2Nx9aUWjNgOGzc9oT6sSAQ0=.adec980e-86b1-419a-9641-b271d904d02d@github.com> On Fri, 15 Jul 2022 12:31:44 GMT, Coleen Phillimore wrote: > Bumped up the PRODUCT stack shadow pages, since if I change the assert(!in_vm) to a guarantee, I get the failure in product mode too. > Tested with tier7 and failed test now passing. Thanks Leonid! ------------- PR: https://git.openjdk.org/jdk/pull/9514 From coleenp at openjdk.org Mon Jul 18 15:08:08 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Mon, 18 Jul 2022 15:08:08 GMT Subject: Integrated: 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" In-Reply-To: References: Message-ID: <_FX41XmqGpRxFmfiyr9ESEQd6cXIPeMZUMrywhYfgmA=.e30322c4-516f-4bcb-91e1-2fa84fb5e526@github.com> On Fri, 15 Jul 2022 12:31:44 GMT, Coleen Phillimore wrote: > Bumped up the PRODUCT stack shadow pages, since if I change the assert(!in_vm) to a guarantee, I get the failure in product mode too. > Tested with tier7 and failed test now passing. This pull request has now been integrated. Changeset: 6882f0eb Author: Coleen Phillimore URL: https://git.openjdk.org/jdk/commit/6882f0eb39a1a1db1393925fab4143a725a96b6a Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod 8290013: serviceability/jvmti/GetLocalVariable/GetLocalWithoutSuspendTest.java failed "assert(!in_vm) failed: Undersized StackShadowPages" Reviewed-by: lmesnik ------------- PR: https://git.openjdk.org/jdk/pull/9514 From thartmann at openjdk.org Mon Jul 18 16:03:58 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 18 Jul 2022 16:03:58 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 10:15:52 GMT, Jie Fu wrote: >> Hi all, >> >> Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments All tests passed. ------------- PR: https://git.openjdk.org/jdk/pull/9509 From aph at openjdk.org Mon Jul 18 16:31:02 2022 From: aph at openjdk.org (Andrew Haley) Date: Mon, 18 Jul 2022 16:31:02 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v8] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: 8289046: Undefined Behaviour in x86 class Assembler ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/2375704b..c81b6294 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=07 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=06-07 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From kvn at openjdk.org Mon Jul 18 16:41:03 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 18 Jul 2022 16:41:03 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 10:15:52 GMT, Jie Fu wrote: >> Hi all, >> >> Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9509 From jwilhelm at openjdk.org Mon Jul 18 21:43:03 2022 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Mon, 18 Jul 2022 21:43:03 GMT Subject: RFR: Merge jdk19 Message-ID: Forwardport JDK 19 -> JDK 20 ------------- Commit messages: - Merge remote-tracking branch 'jdk19/master' into Merge_jdk19 - 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 - 8289127: Apache Lucene triggers: DEBUG MESSAGE: duplicated predicate failed which is impossible The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=jdk&pr=9544&range=00.0 - jdk19: https://webrevs.openjdk.org/?repo=jdk&pr=9544&range=00.1 Changes: https://git.openjdk.org/jdk/pull/9544/files Stats: 175 lines in 4 files changed: 171 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9544.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9544/head:pull/9544 PR: https://git.openjdk.org/jdk/pull/9544 From jiefu at openjdk.org Mon Jul 18 22:52:04 2022 From: jiefu at openjdk.org (Jie Fu) Date: Mon, 18 Jul 2022 22:52:04 GMT Subject: RFR: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX [v2] In-Reply-To: References: Message-ID: On Sun, 17 Jul 2022 10:15:52 GMT, Jie Fu wrote: >> Hi all, >> >> Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. >> >> Thanks. >> Best regards, >> Jie > > Jie Fu has updated the pull request incrementally with one additional commit since the last revision: > > Address review comments Thank you all for the review and testing. ------------- PR: https://git.openjdk.org/jdk/pull/9509 From jiefu at openjdk.org Mon Jul 18 22:52:05 2022 From: jiefu at openjdk.org (Jie Fu) Date: Mon, 18 Jul 2022 22:52:05 GMT Subject: Integrated: 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 09:13:04 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial change which adds `UseAVX, UseSSE and UseSVE` to the whitelist of IR test framework. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 4a4d8ed8 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/4a4d8ed83bea048cbfa6ab4c2ef6aa066cefe650 Stats: 5 lines in 3 files changed: 4 ins; 0 del; 1 mod 8289801: [IR Framework] Add flags to whitelist which can be used to simulate a specific machine setup like UseAVX Reviewed-by: kvn, xgong, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9509 From jwilhelm at openjdk.org Mon Jul 18 22:55:07 2022 From: jwilhelm at openjdk.org (Jesper Wilhelmsson) Date: Mon, 18 Jul 2022 22:55:07 GMT Subject: Integrated: Merge jdk19 In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 21:30:55 GMT, Jesper Wilhelmsson wrote: > Forwardport JDK 19 -> JDK 20 This pull request has now been integrated. Changeset: 6cd1c0c1 Author: Jesper Wilhelmsson URL: https://git.openjdk.org/jdk/commit/6cd1c0c14e7c9f9e8f77b32adcb792556645c0ac Stats: 175 lines in 4 files changed: 171 ins; 0 del; 4 mod Merge ------------- PR: https://git.openjdk.org/jdk/pull/9544 From pli at openjdk.org Tue Jul 19 02:06:17 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 19 Jul 2022 02:06:17 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v2] In-Reply-To: References: Message-ID: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: Address comment from merykitty ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9508/files - new: https://git.openjdk.org/jdk/pull/9508/files/f65be1c8..8a3c704a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9508&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9508&range=00-01 Stats: 14 lines in 1 file changed: 9 ins; 2 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9508.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9508/head:pull/9508 PR: https://git.openjdk.org/jdk/pull/9508 From pli at openjdk.org Tue Jul 19 02:06:18 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 19 Jul 2022 02:06:18 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% Hi @merykitty , I have pushed a new commit based on your suggestions and re-tested it. Please have a look. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From fyang at openjdk.org Tue Jul 19 06:43:11 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 19 Jul 2022 06:43:11 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 Message-ID: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under macro DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. Like AArch64 port, this PR simply workaround this issue by implementing the NativeMovRegMem class for RISC-V port. Testing: release/fastdebug builds without --disable-warnings-as-errors ------------- Commit messages: - 8290496: riscv: Fix build warnings-as-errors with GCC 11 Changes: https://git.openjdk.org/jdk/pull/9550/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9550&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290496 Stats: 43 lines in 2 files changed: 6 ins; 14 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/9550.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9550/head:pull/9550 PR: https://git.openjdk.org/jdk/pull/9550 From xgong at openjdk.org Tue Jul 19 07:58:04 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 19 Jul 2022 07:58:04 GMT Subject: RFR: 8290034: Auto vectorize reverse bit operations. In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 08:01:09 GMT, Jatin Bhateja wrote: > Summary of changes: > - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. > - Handle auto-vectorization of Integer/Long.reverse bit operations. > - Backend implementation for these were added with 4th incubation of VectorAPIs. > > Following are performance number for newly added JMH mocro benchmarks:- > > > No-GFNI(CLX): > ============= > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 1.085 us/op > Longs.reverse 500 avgt 2 1.236 us/op > WithOpt: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.104 us/op > Longs.reverse 500 avgt 2 0.255 us/op > > With-GFNI(ICX): > =============== > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.887 us/op > Longs.reverse 500 avgt 2 1.095 us/op > > Without: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.037 us/op > Longs.reverse 500 avgt 2 0.145 us/op > > > Kindly review and share feedback. > > Best Regards, > Jatin Common codes looks good to me. Just some style issues. src/hotspot/share/opto/subnode.cpp line 1917: > 1915: const TypeInt* t1int = t1->isa_int(); > 1916: if (t1int && t1int->is_con()) { > 1917: jint res = reverse_bits(t1int->get_con()); There is one more space between `res = reverse_bits` src/hotspot/share/opto/subnode.cpp line 1924: > 1922: > 1923: const Type* ReverseLNode::Value(PhaseGVN* phase) const { > 1924: const Type *t1 = phase->type( in(1) ); The same with `"ReverseINode::Value"` src/hotspot/share/opto/subnode.cpp line 1930: > 1928: const TypeLong* t1long = t1->isa_long(); > 1929: if (t1long->is_con()) { > 1930: jint res = reverse_bits(t1long->get_con()); ditto ------------- PR: https://git.openjdk.org/jdk/pull/9535 From yadongwang at openjdk.org Tue Jul 19 08:20:57 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Tue, 19 Jul 2022 08:20:57 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: On Tue, 19 Jul 2022 06:35:28 GMT, Fei Yang wrote: > Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. > Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. > > One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under macro DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. > > So PR takes another way fixing this warning by implementing the NativeMovRegMem class for RISC-V like Like AArch64 port. > > Testing: release & fastdebug build OK without --disable-warnings-as-errors lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.org/jdk/pull/9550 From yadongwang at openjdk.org Tue Jul 19 09:14:03 2022 From: yadongwang at openjdk.org (Yadong Wang) Date: Tue, 19 Jul 2022 09:14:03 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: On Tue, 19 Jul 2022 06:35:28 GMT, Fei Yang wrote: > Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. > Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. > > One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under macro DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. > > So PR takes another way fixing this warning by implementing the NativeMovRegMem class for RISC-V like Like AArch64 port. > > Testing: release & fastdebug build OK without --disable-warnings-as-errors Try GCC 12 if possible? ------------- PR: https://git.openjdk.org/jdk/pull/9550 From aph at openjdk.org Tue Jul 19 10:10:10 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 19 Jul 2022 10:10:10 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v9] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Update src/hotspot/cpu/x86/register_x86.hpp Co-authored-by: Aleksey Shipil?v ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/c81b6294..3006d36a Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From aph at openjdk.org Tue Jul 19 10:10:12 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 19 Jul 2022 10:10:12 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v5] In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 16:24:28 GMT, Vladimir Ivanov wrote: >> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: >> >> Delete changes to hotspot/shared. > > src/hotspot/cpu/x86/assembler_x86.cpp line 2592: > >> 2590: vex_prefix(src, 0, dst->encoding(), VEX_SIMD_NONE, VEX_OPCODE_0F, &attributes); >> 2591: emit_int8((unsigned char)0x90); >> 2592: emit_operand(as_Register(dst->encoding()), src); > > I'm in favor of KRegister-specific `emit_operand` overload here. > > > static int raw_encode(KRegister kreg) { > assert(kreg == knoreg || kreg->is_valid(), "sanity"); > int kreg_enc = kreg->raw_encoding(); > assert(kreg_enc == -1 || is_valid_encoding(kreg_enc), "sanity"); > return kreg_enc; > } > > void Assembler::emit_operand(KRegister kreg, Address adr, > int rip_relative_correction) { > emit_operand(kreg, adr._base, adr._index, adr._scale, adr._disp, > adr._rspec, > rip_relative_correction); > } > > void Assembler::emit_operand(KRegister kreg, Register base, Register index, > Address::ScaleFactor scale, int disp, > RelocationHolder const& rspec, > int rip_relative_correction) { > assert(!index->is_valid() || index != rsp, "illegal addressing mode"); > emit_operand_helper(raw_encode(kreg), raw_encode(base), raw_encode(index), > scale, disp, rspec, rip_relative_correction); > } OK now? ------------- PR: https://git.openjdk.org/jdk/pull/9261 From duke at openjdk.org Tue Jul 19 10:17:46 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 19 Jul 2022 10:17:46 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v2] In-Reply-To: References: Message-ID: On Tue, 19 Jul 2022 02:06:17 GMT, Pengfei Li wrote: >> Recently we found some array range checks in loops are not hoisted by >> C2's loop predication phase as expected. Below is a typical case. >> >> for (int i = 0; i < size; i++) { >> b[3 * i] = a[3 * i]; >> } >> >> Ideally, C2 can hoist the range check of an array access in loop if the >> array index is a linear function of the loop's induction variable (iv). >> Say, range check in `arr[exp]` can be hoisted if >> >> exp = k1 * iv + k2 + inv >> >> where `k1` and `k2` are compile-time constants, and `inv` is an optional >> loop invariant. But in above case, C2 igvn does some strength reduction >> on the `MulINode` used to compute `3 * i`. It results in the linear index >> expression not being recognized. So far we found 2 ideal transformations >> that may affect linear expression recognition. They are >> >> - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values >> - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value >> >> To avoid range check hoisting and further optimizations being broken, we >> have tried improving the linear recognition. But after some experiments, >> we found complex and recursive pattern match does not always work well. >> In this patch we propose to defer these 2 ideal transformations to the >> phase of post loop igvn. In other words, these 2 strength reductions can >> only be done after all loop optimizations are over. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. >> We also tested the performance via JMH and see obvious improvement. >> >> Benchmark Improvement >> RangeCheckHoisting.ivScaled3 +21.2% >> RangeCheckHoisting.ivScaled7 +6.6% > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address comment from merykitty Thanks, it looks good to me, maybe we would need similar tests for long range checks also. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From fjiang at openjdk.org Tue Jul 19 12:29:04 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 19 Jul 2022 12:29:04 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: On Tue, 19 Jul 2022 09:10:37 GMT, Yadong Wang wrote: > Try GCC 12 if possible? I have tried the native release build with GCC 12 (on Ubuntu 22.04, gcc version 12.0.1 20220319 (experimental)). The warning did not appear after this patch. ------------- PR: https://git.openjdk.org/jdk/pull/9550 From roland at openjdk.org Tue Jul 19 12:36:07 2022 From: roland at openjdk.org (Roland Westrelin) Date: Tue, 19 Jul 2022 12:36:07 GMT Subject: RFR: 8290529: C2: assert(BoolTest(btest).is_canonical()) failure Message-ID: For the test case: 1) In Parse::do_if(), tst0 is: (Bool#lt (CmpU 0 Parm0)) 2) transformed by gvn in tst: (Bool#gt (CmpU Parm0 0)) 3) That test is not canonical and is negated and retransformed which results in: (Bool#eq (CmpI Parm0 0)) The assert fires because that test is not canonical either. The root cause I think is that the (CmpU .. 0) -> (CmpI .. 0) only triggers if the condition of the CmpU is canonical (and results in a non canonical test). Tweaking it so it applies even if the condition is not leads to the following change in the steps above: 2) (Bool#ne (CmpI Parm0 0)) which is a canonical test. ------------- Commit messages: - test - more - fix Changes: https://git.openjdk.org/jdk/pull/9553/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9553&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290529 Stats: 49 lines in 2 files changed: 46 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9553.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9553/head:pull/9553 PR: https://git.openjdk.org/jdk/pull/9553 From fjiang at openjdk.org Tue Jul 19 12:52:47 2022 From: fjiang at openjdk.org (Feilong Jiang) Date: Tue, 19 Jul 2022 12:52:47 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: On Tue, 19 Jul 2022 06:35:28 GMT, Fei Yang wrote: > Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. > Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. > > One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under macro DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. > > So PR takes another way fixing this warning by implementing the NativeMovRegMem class for RISC-V like Like AArch64 port. > > Testing: release & fastdebug build OK without --disable-warnings-as-errors Marked as reviewed by fjiang (Author). ------------- PR: https://git.openjdk.org/jdk/pull/9550 From jiefu at openjdk.org Tue Jul 19 15:14:37 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 19 Jul 2022 15:14:37 GMT Subject: RFR: 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification Message-ID: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> Hi all, Please review this trivial patch which fixes the TestMaskedMacroLogicVector.java failure when `-XX:UseSSE < 4` on AVX512 machines. The reason is that masked `AndV` requires `Op_VectorBlend`. However, `Op_VectorBlend` is not supported when `UseSSE < 4`. So part of the sub-tests (which check the masked `AndV > 0`) should be skipped when `UseSSE < 4` Thanks. Best regards, Jie ------------- Commit messages: - 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification Changes: https://git.openjdk.org/jdk/pull/9559/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9559&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290511 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9559.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9559/head:pull/9559 PR: https://git.openjdk.org/jdk/pull/9559 From kvn at openjdk.org Tue Jul 19 15:20:46 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 19 Jul 2022 15:20:46 GMT Subject: RFR: 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification In-Reply-To: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> References: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> Message-ID: On Tue, 19 Jul 2022 15:04:38 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial patch which fixes the TestMaskedMacroLogicVector.java failure when `-XX:UseSSE < 4` on AVX512 machines. > > The reason is that masked `AndV` requires `Op_VectorBlend`. > However, `Op_VectorBlend` is not supported when `UseSSE < 4`. > > So part of the sub-tests (which check the masked `AndV > 0`) should be skipped when `UseSSE < 4` > > Thanks. > Best regards, > Jie Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9559 From thartmann at openjdk.org Tue Jul 19 15:54:04 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 19 Jul 2022 15:54:04 GMT Subject: RFR: 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification In-Reply-To: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> References: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> Message-ID: On Tue, 19 Jul 2022 15:04:38 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial patch which fixes the TestMaskedMacroLogicVector.java failure when `-XX:UseSSE < 4` on AVX512 machines. > > The reason is that masked `AndV` requires `Op_VectorBlend`. > However, `Op_VectorBlend` is not supported when `UseSSE < 4`. > > So part of the sub-tests (which check the masked `AndV > 0`) should be skipped when `UseSSE < 4` > > Thanks. > Best regards, > Jie Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9559 From jiefu at openjdk.org Tue Jul 19 22:55:45 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 19 Jul 2022 22:55:45 GMT Subject: RFR: 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification In-Reply-To: References: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> Message-ID: On Tue, 19 Jul 2022 15:17:37 GMT, Vladimir Kozlov wrote: >> Hi all, >> >> Please review this trivial patch which fixes the TestMaskedMacroLogicVector.java failure when `-XX:UseSSE < 4` on AVX512 machines. >> >> The reason is that masked `AndV` requires `Op_VectorBlend`. >> However, `Op_VectorBlend` is not supported when `UseSSE < 4`. >> >> So part of the sub-tests (which check the masked `AndV > 0`) should be skipped when `UseSSE < 4` >> >> Thanks. >> Best regards, >> Jie > > Good. Thanks @vnkozlov and @TobiHartmann for the review. ------------- PR: https://git.openjdk.org/jdk/pull/9559 From jiefu at openjdk.org Tue Jul 19 22:55:46 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 19 Jul 2022 22:55:46 GMT Subject: Integrated: 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification In-Reply-To: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> References: <9U75Rnv5w9IDxSXEzwmkUU3lrLf1K9gFqhmILdWAguA=.1952004d-4349-4ff5-aee1-7e7714f0dc49@github.com> Message-ID: On Tue, 19 Jul 2022 15:04:38 GMT, Jie Fu wrote: > Hi all, > > Please review this trivial patch which fixes the TestMaskedMacroLogicVector.java failure when `-XX:UseSSE < 4` on AVX512 machines. > > The reason is that masked `AndV` requires `Op_VectorBlend`. > However, `Op_VectorBlend` is not supported when `UseSSE < 4`. > > So part of the sub-tests (which check the masked `AndV > 0`) should be skipped when `UseSSE < 4` > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 43588648 Author: Jie Fu URL: https://git.openjdk.org/jdk/commit/43588648cacaa79a586ace8540dfe43eb64f9a46 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod 8290511: compiler/vectorapi/TestMaskedMacroLogicVector.java fails IR verification Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9559 From fyang at openjdk.org Wed Jul 20 01:14:48 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 20 Jul 2022 01:14:48 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: On Tue, 19 Jul 2022 09:10:37 GMT, Yadong Wang wrote: >> Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. >> Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. >> >> One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under macro DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. >> >> So PR takes another way fixing this warning by implementing the NativeMovRegMem class for RISC-V like Like AArch64 port. >> >> Testing: release & fastdebug build OK without --disable-warnings-as-errors > > Try GCC 12 if possible? @yadongw @feilongjiang : Thanks. Could we have a Reviewer please? @shipilev ? ------------- PR: https://git.openjdk.org/jdk/pull/9550 From xgong at openjdk.org Wed Jul 20 07:08:44 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Wed, 20 Jul 2022 07:08:44 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions Message-ID: The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: ldr q16, [x15, #16] ; load the "src" vector mov v17.16b, v16.16b ; reverse bytes "src" ldr q18, [x13, #16] neg v18.16b, v18.16b ; load the vector mask bsl v18.16b, v17.16b, v16.16b ; vector blend The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. Here is the performance data for the jmh benchmark [1] on ARM NEON: Benchmark (size) Mode Cnt Before After Units ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. And here is the performance data on an x86 system: Benchmark (size) Mode Cnt Before After Units ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 ------------- Commit messages: - 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions Changes: https://git.openjdk.org/jdk/pull/9565/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9565&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290485 Stats: 123 lines in 4 files changed: 121 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9565.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9565/head:pull/9565 PR: https://git.openjdk.org/jdk/pull/9565 From shade at openjdk.org Wed Jul 20 08:02:03 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 20 Jul 2022 08:02:03 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: <5JjcW8isTPmCfVBWMG7WFthyWDckSgBlWaozz-OYoxA=.a6a3959e-1c01-4c39-b306-0c5390b36df3@github.com> On Tue, 19 Jul 2022 06:35:28 GMT, Fei Yang wrote: > Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. > Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. > > One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under condition #ifndef DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. > > So PR takes another way fixing this warning by implementing the NativeMovRegMem class for RISC-V like Like AArch64 port. > > Testing: release & fastdebug build OK without --disable-warnings-as-errors Looks okay! ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/9550 From fyang at openjdk.org Wed Jul 20 08:27:57 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 20 Jul 2022 08:27:57 GMT Subject: RFR: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: <5JjcW8isTPmCfVBWMG7WFthyWDckSgBlWaozz-OYoxA=.a6a3959e-1c01-4c39-b306-0c5390b36df3@github.com> References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> <5JjcW8isTPmCfVBWMG7WFthyWDckSgBlWaozz-OYoxA=.a6a3959e-1c01-4c39-b306-0c5390b36df3@github.com> Message-ID: On Wed, 20 Jul 2022 08:00:00 GMT, Aleksey Shipilev wrote: > Looks okay! Thanks all for the review. ------------- PR: https://git.openjdk.org/jdk/pull/9550 From fyang at openjdk.org Wed Jul 20 08:29:22 2022 From: fyang at openjdk.org (Fei Yang) Date: Wed, 20 Jul 2022 08:29:22 GMT Subject: Integrated: 8290496: riscv: Fix build warnings-as-errors with GCC 11 In-Reply-To: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> References: <8WSWNQMDGWmlp_0zzJFJJh1qMbBhuGiU_OvRcm23kx4=.8f1cc5f0-1084-4ead-9357-73e200699c46@github.com> Message-ID: On Tue, 19 Jul 2022 06:35:28 GMT, Fei Yang wrote: > Like AArch64 port, RISC-V port defines DEOPTIMIZE_WHEN_PATCHING and does not make use of C1 runtime patching. > Then there is no need to implement class NativeMovRegMem in the port-specific code. But that will make GCC 11 unhappy. > > One way would be guarding the C1 shared code like class PatchingStub which uses class NativeMovRegMem under condition #ifndef DEOPTIMIZE_WHEN_PATCHING. But turns out class PatchingStub are still partially used (mainly the PatchID enum) even under DEOPTIMIZE_WHEN_PATCHING. > > So PR takes another way fixing this warning by implementing the NativeMovRegMem class for RISC-V like Like AArch64 port. > > Testing: release & fastdebug build OK without --disable-warnings-as-errors This pull request has now been integrated. Changeset: 5425573b Author: Fei Yang URL: https://git.openjdk.org/jdk/commit/5425573bb4de1a2434201bc7ec3700b527ce346b Stats: 43 lines in 2 files changed: 6 ins; 14 del; 23 mod 8290496: riscv: Fix build warnings-as-errors with GCC 11 Reviewed-by: yadongwang, fjiang, shade ------------- PR: https://git.openjdk.org/jdk/pull/9550 From shade at openjdk.org Wed Jul 20 09:53:44 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 20 Jul 2022 09:53:44 GMT Subject: RFR: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled Message-ID: I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, Register thread, Register obj, Register var_size_in_bytes, int con_size_in_bytes, Register t1, Label& slow_case) { ... if (!Universe::heap()->supports_inline_contig_alloc()) { __ jmp(slow_case); ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). Additional testing: - [x] Linux x86_32 fastdebug `tier1` ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/9567/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9567&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290704 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9567.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9567/head:pull/9567 PR: https://git.openjdk.org/jdk/pull/9567 From ngasson at openjdk.org Wed Jul 20 14:29:03 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 20 Jul 2022 14:29:03 GMT Subject: RFR: 8288107: Auto-vectorization for integer min/max [v2] In-Reply-To: References: Message-ID: <65Hz21tK1ccrHTk9kcW1bD9cdj0GNW6xvORgySt9084=.03ee754d-dc41-43f1-8474-0870b09e6936@github.com> On Fri, 15 Jul 2022 10:44:52 GMT, Bhavana-Kilambi wrote: >> When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. >> A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : >> >>
Before this patch >> >> **aarch64:** >> ``` >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op >> >>
>> >>
After this patch >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op >> >> >>
>> >> With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. >> >>
Performance numbers >> >> **aarch64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op >> >> >> **x86-64:** >> >> Benchmark (length) (seed) Mode Cnt Score Error Units >> VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op >> VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op >> VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op >> VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op >> >>
>> >> There is no degradation when vectorization is disabled. > > Bhavana-Kilambi has updated the pull request incrementally with one additional commit since the last revision: > > 8288107: Auto-vectorization for integer min/max > > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > > Before this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > > After this patch: > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. The performance numbers are shown below : > aarch64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > x86-64: > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > There is no degradation when vectorization is disabled. Looks good to me. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.org/jdk/pull/9466 From tholenstein at openjdk.org Wed Jul 20 14:36:14 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 20 Jul 2022 14:36:14 GMT Subject: RFR: JDK-8290069: IGV: Highlight both graphs of difference in outline In-Reply-To: <7EkUOt61nSPRSKyL8A8kFOHlVE6UEFRxwWbcFuuBDAo=.2f1acdfe-d9d8-4a82-8a07-5f075f3b939e@github.com> References: <7EkUOt61nSPRSKyL8A8kFOHlVE6UEFRxwWbcFuuBDAo=.2f1acdfe-d9d8-4a82-8a07-5f075f3b939e@github.com> Message-ID: On Tue, 12 Jul 2022 16:25:09 GMT, Vladimir Kozlov wrote: >> Previously, IGV highlighted only one graph in the outline when a difference graph is selected using the sliders. >> Now, IGV highlights both graphs used to calculate the difference graph when they are in the same group. >> >> highlight both graphs >> >> IGV colors the nodes in a difference graph with yellow/red/green to highlight the changes. This only worked if the difference graph is calculated using the sliders. Now, difference graphs is also coloured when calculated via the context menu "Difference to current graph" in the outline. >> >> Show colors > > Marked as reviewed by kvn (Reviewer). Thanks @vnkozlov and @TobiHartmann for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/9468 From tholenstein at openjdk.org Wed Jul 20 14:36:16 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 20 Jul 2022 14:36:16 GMT Subject: Integrated: JDK-8290069: IGV: Highlight both graphs of difference in outline In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 13:10:37 GMT, Tobias Holenstein wrote: > Previously, IGV highlighted only one graph in the outline when a difference graph is selected using the sliders. > Now, IGV highlights both graphs used to calculate the difference graph when they are in the same group. > > highlight both graphs > > IGV colors the nodes in a difference graph with yellow/red/green to highlight the changes. This only worked if the difference graph is calculated using the sliders. Now, difference graphs is also coloured when calculated via the context menu "Difference to current graph" in the outline. > > Show colors This pull request has now been integrated. Changeset: 3d3e3df8 Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/3d3e3df8f0845d1ce1776ef37b4a2b39461a328a Stats: 36 lines in 5 files changed: 27 ins; 0 del; 9 mod 8290069: IGV: Highlight both graphs of difference in outline Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9468 From tholenstein at openjdk.org Wed Jul 20 15:10:47 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 20 Jul 2022 15:10:47 GMT Subject: RFR: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window [v2] In-Reply-To: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> References: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> Message-ID: > A graph in IGV can be moved by dragging it with the left mouse button (called panning). > ![panning](https://user-images.githubusercontent.com/71546117/178509416-24dd900f-131b-484b-af47-c7a78e791434.png) > > If the mouse left the visible window of the graph during dragging, the diagram started to move in the opposite direction. This was annoying. Now panning stops as soon as the mouse leaves the window. > ![stop reverse panning](https://user-images.githubusercontent.com/71546117/178509309-3df03b7a-ada4-45a3-b9a7-d6e10664033d.png) > > In selection mode, the graph still moves when the mouse is dragged outside the window, as this is meant to make a larger selection. > ![keep panning for selection](https://user-images.githubusercontent.com/71546117/178509302-74fa41d2-e611-40a3-b6b0-c937ef4b2462.png) Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: fixed locking in CustomizablePanAction ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9470/files - new: https://git.openjdk.org/jdk/pull/9470/files/942a34d5..e6647572 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9470&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9470&range=00-01 Stats: 53 lines in 1 file changed: 13 ins; 17 del; 23 mod Patch: https://git.openjdk.org/jdk/pull/9470.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9470/head:pull/9470 PR: https://git.openjdk.org/jdk/pull/9470 From tholenstein at openjdk.org Wed Jul 20 15:33:05 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Wed, 20 Jul 2022 15:33:05 GMT Subject: RFR: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window [v2] In-Reply-To: References: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> Message-ID: On Mon, 18 Jul 2022 11:09:59 GMT, Tobias Hartmann wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> fixed locking in CustomizablePanAction > > src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomizablePanAction.java line 149: > >> 147: scrollPane = null; >> 148: } >> 149: return state ? State.REJECTED : State.REJECTED; > > That does not look correct. Thanks for catching that! You are right, this is not correct. I updated the PR to make the locking more clear: `CustomizablePanAction` extends `WidgetAction.LockedAdapter` which is used for long-term actions like panning/moving. All methods return the `locked` state if `isLocked()` is true and `rejected` otherwise. The `locked` state means the event is processed and the processing has to stopped immediately (no other action should processed it). The `rejected` state the event is not processed by the action and has to be processed by other actions too. In our case `CustomizablePanAction` should be `locked` when the user presses the mouse inside of the panning area (the graph). And `CustomizablePanAction` should be `unlocked` when the mouse button is released again. Leaving the panning area (`mouseExited`) and entering it again (`mouseEntered`) should not change the locking state as long as the user does not release the mouse. ------------- PR: https://git.openjdk.org/jdk/pull/9470 From duke at openjdk.org Wed Jul 20 15:41:08 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Wed, 20 Jul 2022 15:41:08 GMT Subject: Integrated: 8288107: Auto-vectorization for integer min/max In-Reply-To: References: Message-ID: On Tue, 12 Jul 2022 11:45:28 GMT, Bhavana-Kilambi wrote: > When Math.min/max is invoked on integer arrays, it generates the CMP-CMOVE instructions instead of vectorizing the loop(if vectorizable and relevant ISA is available) using vector equivalent of min/max instructions. Emitting MaxI/MinI nodes instead of Cmp/CmoveI nodes results in the loop getting vectorized eventually and the architecture specific min/max vector instructions are generated. > A test to assess the performance of Math.max/min and StrictMath.max/min is added. On aarch64, the smin/smax instructions are generated when the loop is vectorized. On x86-64, vectorization support for min/max operations is available only in SSE4 (where pmaxsd/pminsd are generated) and AVX version >= 1 (where vpmaxsd/vpminsd are generated). This patch generates these instructions only when the loop is vectorized. In cases where the loop is not vectorizable or when the Math.max/min operations are called outside of the loop, cmp-cmove instructions are generated (tested on aarch64, x86-64 machines which have cmp-cmove instructions defined for the scalar MaxI/MinI nodes). Performance comparisons for the VectorIntMinMax.java test with and without the patch are given below : > >
Before this patch > > **aarch64:** > ``` > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1593.510 ? 1.488 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1593.123 ? 1.365 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1593.112 ? 0.985 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1593.290 ? 1.219 ns/op > > > **x86-64:** > > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2084.717 ? 4.780 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2087.322 ? 4.158 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2084.568 ? 4.838 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2086.595 ? 4.025 ns/op > >
> >
After this patch > > **aarch64:** > > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 323.911 ? 0.206 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 324.084 ? 0.231 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 323.892 ? 0.234 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 323.990 ? 0.295 ns/op > > > **x86-64:** > > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 387.639 ? 0.512 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 387.999 ? 0.740 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 387.605 ? 0.376 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 387.765 ? 0.498 ns/op > > >
> > With auto-vectorization, both the machines exhibit a significant performance gain. On both the machines the runtime is ~80% better than the case without this patch. Also ran the patch with -XX:-UseSuperWord to make sure the performance does not degrade in cases where vectorization does not happen. > >
Performance numbers > > **aarch64:** > > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 1449.792 ? 1.072 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 1450.636 ? 1.057 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 1450.214 ? 1.093 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 1450.615 ? 1.098 ns/op > > > **x86-64:** > > Benchmark (length) (seed) Mode Cnt Score Error Units > VectorIntMinMax.testMaxInt 2048 0 avgt 25 2059.673 ? 4.726 ns/op > VectorIntMinMax.testMinInt 2048 0 avgt 25 2059.853 ? 4.754 ns/op > VectorIntMinMax.testStrictMaxInt 2048 0 avgt 25 2059.920 ? 4.658 ns/op > VectorIntMinMax.testStrictMinInt 2048 0 avgt 25 2059.622 ? 4.768 ns/op > >
> > There is no degradation when vectorization is disabled. This pull request has now been integrated. Changeset: 89458e36 Author: Bhavana Kilambi Committer: Nick Gasson URL: https://git.openjdk.org/jdk/commit/89458e36afa8f09020d2afba1cbafdd8e32a6083 Stats: 387 lines in 4 files changed: 211 ins; 171 del; 5 mod 8288107: Auto-vectorization for integer min/max Reviewed-by: kvn, ngasson ------------- PR: https://git.openjdk.org/jdk/pull/9466 From kvn at openjdk.org Wed Jul 20 19:37:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 20 Jul 2022 19:37:04 GMT Subject: RFR: 8290529: C2: assert(BoolTest(btest).is_canonical()) failure In-Reply-To: References: Message-ID: <4gLGY-GPwaDaWIpV3sVHbdg7H2nIPpwG19toBs57WKM=.94c08a85-76de-4651-80ae-6aece48bb5f2@github.com> On Tue, 19 Jul 2022 12:28:10 GMT, Roland Westrelin wrote: > For the test case: > > 1) In Parse::do_if(), tst0 is: > > (Bool#lt (CmpU 0 Parm0)) > > 2) transformed by gvn in tst: > > (Bool#gt (CmpU Parm0 0)) > > 3) That test is not canonical and is negated and retransformed which > results in: > > (Bool#eq (CmpI Parm0 0)) > > The assert fires because that test is not canonical either. > > The root cause I think is that the (CmpU .. 0) -> (CmpI .. 0) only > triggers if the condition of the CmpU is canonical (and results in a > non canonical test). Tweaking it so it applies even if the condition > is not leads to the following change in the steps above: > > 2) (Bool#ne (CmpI Parm0 0)) > > which is a canonical test. Changes are good. If possible add IR framework test. src/hotspot/share/opto/subnode.cpp line 1621: > 1619: } > 1620: > 1621: // Change x u< 1 or x u<= 0 to x == 0 Update comment. ------------- PR: https://git.openjdk.org/jdk/pull/9553 From kvn at openjdk.org Wed Jul 20 20:13:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 20 Jul 2022 20:13:00 GMT Subject: RFR: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 09:46:08 GMT, Aleksey Shipilev wrote: > I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: > > > void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, > Register thread, Register obj, > Register var_size_in_bytes, > int con_size_in_bytes, > Register t1, > Label& slow_case) { > ... > if (!Universe::heap()->supports_inline_contig_alloc()) { > __ jmp(slow_case); > > > ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. > > (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). > > Additional testing: > - [x] Linux x86_32 fastdebug `tier1` Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9567 From kvn at openjdk.org Wed Jul 20 20:26:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 20 Jul 2022 20:26:57 GMT Subject: RFR: 8290034: Auto vectorize reverse bit operations. In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 08:01:09 GMT, Jatin Bhateja wrote: > Summary of changes: > - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. > - Handle auto-vectorization of Integer/Long.reverse bit operations. > - Backend implementation for these were added with 4th incubation of VectorAPIs. > > Following are performance number for newly added JMH mocro benchmarks:- > > > No-GFNI(CLX): > ============= > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 1.085 us/op > Longs.reverse 500 avgt 2 1.236 us/op > WithOpt: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.104 us/op > Longs.reverse 500 avgt 2 0.255 us/op > > With-GFNI(ICX): > =============== > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.887 us/op > Longs.reverse 500 avgt 2 1.095 us/op > > Without: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.037 us/op > Longs.reverse 500 avgt 2 0.145 us/op > > > Kindly review and share feedback. > > Best Regards, > Jatin Looks good. I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/9535 From jrose at openjdk.org Wed Jul 20 22:40:02 2022 From: jrose at openjdk.org (John R Rose) Date: Wed, 20 Jul 2022 22:40:02 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v2] In-Reply-To: References: Message-ID: On Tue, 19 Jul 2022 02:06:17 GMT, Pengfei Li wrote: >> Recently we found some array range checks in loops are not hoisted by >> C2's loop predication phase as expected. Below is a typical case. >> >> for (int i = 0; i < size; i++) { >> b[3 * i] = a[3 * i]; >> } >> >> Ideally, C2 can hoist the range check of an array access in loop if the >> array index is a linear function of the loop's induction variable (iv). >> Say, range check in `arr[exp]` can be hoisted if >> >> exp = k1 * iv + k2 + inv >> >> where `k1` and `k2` are compile-time constants, and `inv` is an optional >> loop invariant. But in above case, C2 igvn does some strength reduction >> on the `MulINode` used to compute `3 * i`. It results in the linear index >> expression not being recognized. So far we found 2 ideal transformations >> that may affect linear expression recognition. They are >> >> - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values >> - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value >> >> To avoid range check hoisting and further optimizations being broken, we >> have tried improving the linear recognition. But after some experiments, >> we found complex and recursive pattern match does not always work well. >> In this patch we propose to defer these 2 ideal transformations to the >> phase of post loop igvn. In other words, these 2 strength reductions can >> only be done after all loop optimizations are over. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. >> We also tested the performance via JMH and see obvious improvement. >> >> Benchmark Improvement >> RangeCheckHoisting.ivScaled3 +21.2% >> RangeCheckHoisting.ivScaled7 +6.6% > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address comment from merykitty src/hotspot/share/opto/mulnode.cpp line 369: > 367: Node *res = NULL; > 368: julong bit1 = abs_con & (0-abs_con); // Extract low bit > 369: if (bit1 == abs_con) { // Found a power of 2? May I suggest, since you are working on this code, maybe factoring out the idiom `x & -x`? (Sadly, I think there may be multiple opinions about exactly where to name this idiom, which might prevent us from starting the fix. If that's the case I suggest a follow-up bug.) diff --git a/src/hotspot/share/utilities/powerOfTwo.hpp b/src/hotspot/share/utilities/powerOfTwo.hpp index a98b81e8037..83713d373ee 100644 --- a/src/hotspot/share/utilities/powerOfTwo.hpp +++ b/src/hotspot/share/utilities/powerOfTwo.hpp @@ -119,4 +119,14 @@ inline T next_power_of_2(T value) { return round_up_power_of_2(value + 1); } +// Return the largest power of two that is a submultiple of the given value. +// This is the same as the numeric value of the least-significant set bit. +// For unsigned values, it replaces the old trick of (value & -value). +// precondition: value > 0. +template::value)> +inline T submultiple_power_of_2(T value) { + assert(value > 0, "Invalid value"); + return value & -value; +} + #endif // SHARE_UTILITIES_POWEROFTWO_HPP src/hotspot/share/opto/mulnode.cpp line 376: > 374: bit2 = bit2 & (0-bit2); // Extract 2nd bit > 375: if (bit2 + bit1 == abs_con) { // Found all bits in con? > 376: if (!phase->C->post_loop_opts_phase()) { Thanks or adding the MulL case. Nowadays the range check optimizations don't apply only to arrays, and apply to both int and long indexes, mainly because of Panama. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From kvn at openjdk.org Thu Jul 21 00:51:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Jul 2022 00:51:44 GMT Subject: Integrated: 8290746: ProblemList compiler/vectorization/TestAutoVecIntMinMax.java Message-ID: In order to reduce the noise in the JDK20 CI, I'm ProblemListing test on x64: compiler/vectorization/TestAutoVecIntMinMax.java until JDK-8290730 is fixed. ------------- Commit messages: - 8290746: ProblemList compiler/vectorization/TestAutoVecIntMinMax.java Changes: https://git.openjdk.org/jdk/pull/9580/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9580&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290746 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9580.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9580/head:pull/9580 PR: https://git.openjdk.org/jdk/pull/9580 From dholmes at openjdk.org Thu Jul 21 00:51:44 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 21 Jul 2022 00:51:44 GMT Subject: Integrated: 8290746: ProblemList compiler/vectorization/TestAutoVecIntMinMax.java In-Reply-To: References: Message-ID: <-9N6RY-QD2VCAxXhyicqQ9r_6R6yuovinNen10E2Roc=.1c8b5be1-5a7e-49e4-8ddf-f0335a8786c7@github.com> On Thu, 21 Jul 2022 00:29:57 GMT, Vladimir Kozlov wrote: > In order to reduce the noise in the JDK20 CI, I'm ProblemListing test on x64: > > compiler/vectorization/TestAutoVecIntMinMax.java > > until JDK-8290730 is fixed. LGTM! Thanks ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.org/jdk/pull/9580 From kvn at openjdk.org Thu Jul 21 00:51:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Jul 2022 00:51:44 GMT Subject: Integrated: 8290746: ProblemList compiler/vectorization/TestAutoVecIntMinMax.java In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 00:29:57 GMT, Vladimir Kozlov wrote: > In order to reduce the noise in the JDK20 CI, I'm ProblemListing test on x64: > > compiler/vectorization/TestAutoVecIntMinMax.java > > until JDK-8290730 is fixed. Thank you, David. ------------- PR: https://git.openjdk.org/jdk/pull/9580 From kvn at openjdk.org Thu Jul 21 00:51:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Jul 2022 00:51:44 GMT Subject: Integrated: 8290746: ProblemList compiler/vectorization/TestAutoVecIntMinMax.java In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 00:29:57 GMT, Vladimir Kozlov wrote: > In order to reduce the noise in the JDK20 CI, I'm ProblemListing test on x64: > > compiler/vectorization/TestAutoVecIntMinMax.java > > until JDK-8290730 is fixed. This pull request has now been integrated. Changeset: e8975be9 Author: Vladimir Kozlov URL: https://git.openjdk.org/jdk/commit/e8975be94bfef8fa787eb60ad1eac4cb1d4b9076 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8290746: ProblemList compiler/vectorization/TestAutoVecIntMinMax.java Reviewed-by: dholmes ------------- PR: https://git.openjdk.org/jdk/pull/9580 From eliu at openjdk.org Thu Jul 21 03:23:09 2022 From: eliu at openjdk.org (Eric Liu) Date: Thu, 21 Jul 2022 03:23:09 GMT Subject: RFR: 8290169: adlc: Improve child constraints for vector unary operations In-Reply-To: References: Message-ID: <12nBeqOz3m1MdJOJgCI9uvNPthCxxCPQ1bAEB-S2Dls=.aeae0826-7e1c-4e4b-9dc8-145c3ba6fd65@github.com> On Mon, 18 Jul 2022 07:46:17 GMT, Hao Sun wrote: > As demonstrated in [1], the child constrait generated for *predicated > vector unary operation* is the super set of that generated for the > *unpredicated* version. As a result, there exists a risk for predicated > vector unary operaions to match the unpredicated rules by accident. > > In this patch, we resolve this issue by generating one extra check > "rChild == NULL" ONLY for vector unary operations. In this way, the > child constraints for predicated/unpredicated vector unary operations > are exclusive now. > > Following the example in [1], the dfa state generated for AbsVI is shown > below. > > > void State::_sub_Op_AbsVI(const Node *n){ > if( STATE__VALID_CHILD(_kids[0], VREG) && STATE__VALID_CHILD(_kids[1], PREGGOV) && > ( UseSVE > 0 ) ) > { > unsigned int c = _kids[0]->_cost[VREG]+_kids[1]->_cost[PREGGOV] + SVE_COST; > DFA_PRODUCTION(VREG, vabsI_masked_rule, c) > } > if( STATE__VALID_CHILD(_kids[0], VREG) && _kids[1] == NULL && <---- 1 > ( UseSVE > 0) ) > { > unsigned int c = _kids[0]->_cost[VREG] + SVE_COST; > if (STATE__NOT_YET_VALID(VREG) || _cost[VREG] > c) { > DFA_PRODUCTION(VREG, vabsI_rule, c) > } > } > ... > > > We can see that the constraint at line 1 cannot be matched for > predicated AbsVI node now. > > The main updates are made in adlc/dfa part. Ideally, we should only > add the extra check for affected platforms, i.e. AVX-512 and SVE. But we > didn't do that because it would be better not to introduce any > architecture dependent implementation here. > > Besides, workarounds in both aarch64_sve.ad and x86.ad are removed. 1) > Many "is_predicated_vector()" checks can be removed in aarch64_sve.ad > file. 2) Default instruction cost is used for involving rules in x86.ad > file. > > [1]. https://github.com/shqking/jdk/commit/50ec9b19 Marked as reviewed by eliu (Author). LGTM. ------------- PR: https://git.openjdk.org/jdk/pull/9534 From thartmann at openjdk.org Thu Jul 21 06:47:56 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 21 Jul 2022 06:47:56 GMT Subject: RFR: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled In-Reply-To: References: Message-ID: <7INGe5wLgxHIJNmJrLOoJSrw_8H_yqzHtKECi9-M3aU=.0ace6445-8176-48fb-b866-a15303c01c0d@github.com> On Wed, 20 Jul 2022 09:46:08 GMT, Aleksey Shipilev wrote: > I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: > > > void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, > Register thread, Register obj, > Register var_size_in_bytes, > int con_size_in_bytes, > Register t1, > Label& slow_case) { > ... > if (!Universe::heap()->supports_inline_contig_alloc()) { > __ jmp(slow_case); > > > ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. > > (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). > > Additional testing: > - [x] Linux x86_32 fastdebug `tier1` Marked as reviewed by thartmann (Reviewer). Looks good to me but this code is part of the interpreter and therefore owned by runtime. ------------- PR: https://git.openjdk.org/jdk/pull/9567 From denghui.ddh at alibaba-inc.com Thu Jul 21 07:11:37 2022 From: denghui.ddh at alibaba-inc.com (Denghui Dong) Date: Thu, 21 Jul 2022 15:11:37 +0800 Subject: =?UTF-8?B?Q3Jhc2ggZHVlIHRvIGJhZCBvb3AgbWFw?= Message-ID: <4f48521a-31a3-4a51-9c78-9346c6f968b9.denghui.ddh@alibaba-inc.com> Hi team, We encountered a crash due to a bad oop map. The following steps can quickly reproduce the problem on JDK master on Linux x64: 1. Modify z_x86_64.ad and add 3 temporary registers of type rRegN to zLoadP(If this change is illegal, please let me know) ``` diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad index f3e19b41733..129c93d3da8 100644 --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad @@ -63,11 +63,11 @@ static void z_load_barrier_cmpxchg(MacroAssembler& _masm, const MachNode* node, %} // Load Pointer -instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) +instruct zLoadP(rRegP dst, memory mem, rRegN tmp1, rRegN tmp2, rRegN tmp3, rFlagsReg cr) %{ predicate(UseZGC && n->as_Load()->barrier_data() != 0); match(Set dst (LoadP mem)); - effect(KILL cr, TEMP dst); + effect(KILL cr, TEMP dst, TEMP tmp1, TEMP tmp2, TEMP tmp3); ins_cost(125); ``` 2. Run the following code (./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:-Inline -XX:CompileCommand=compileonly,Test::call -XX:+PrintAssembly -XX:PrintIdealGraphLevel=2 -XX:PrintIdealGraphFile=call.xml -XX:+UseZGC -XX:+UseNewCode -XX:+PrintGC -Xmx500m -Xms500m -XX:-Inline Test) ``` import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.ReentrantLock; class Test { private Lock lock = new ReentrantLock(); public static void main(String... args) throws Exception { new Thread(() -> { while (true) { byte[] b = new byte[10 * 1024 * 1024]; } }).start(); while (true) { new Test().call(() -> { new Object(); }); } } public void call(Runnable cb) { lock.lock(); try { cb.run(); } finally { lock.unlock(); } } } ``` It can be observed through the IGV Final Code that zLoadP(202, 204) and a MachTemp(81) it uses are placed in different blocks. When processing CallStaticJavaDirect(74) in the c2 buildOopMap step, MachTemp(81) is considered to be alive. Because the register type specified by the above modification is rRegN (if it is specified as rRegI or rRegP, no crash will occur), so the type of MachTemp(81) is narrowOop, which will eventually be added to the oopMap of CallStaticJavaDirect(74). But logically, zLoadP doesn't depend on the original value of MachTemp(81), so I don't think it should be added to oopMap. I put the ideal graph file to http://cr.openjdk.java.net/~ddong/call.xml I think the problem may lie in two places: 1. gcm and regAlloc do not put MachTemp(81) in the correct location 2. buildOopMap should not treat MachTemp as Oop because its original value is meaningless I'm not a JIT expert and sorry if there is anything unnatural or non-standard in the above description. Any input is appreciated. Denghui Dong -------------- next part -------------- An HTML attachment was scrubbed... URL: From thartmann at openjdk.org Thu Jul 21 11:01:52 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 21 Jul 2022 11:01:52 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" Message-ID: C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 However, the control flow analysis does not catch this case. I added the missing check. Thanks, Tobias ------------- Commit messages: - 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" Changes: https://git.openjdk.org/jdk/pull/9589/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9589&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290705 Stats: 121 lines in 3 files changed: 121 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9589.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9589/head:pull/9589 PR: https://git.openjdk.org/jdk/pull/9589 From jzhu at openjdk.org Thu Jul 21 11:44:05 2022 From: jzhu at openjdk.org (Joshua Zhu) Date: Thu, 21 Jul 2022 11:44:05 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. In-Reply-To: References: Message-ID: On Thu, 14 Jul 2022 18:23:51 GMT, Jatin Bhateja wrote: > Hi All, > > Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows > significant speedup. > > System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S) > > > Baseline: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms > > > With-opt: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms > > > Kindly review and share your feedback. > > Best Regards, > Jatin Looks good to me. My Cascade Lake server benefits from this change. ------------- PR: https://git.openjdk.org/jdk/pull/9498 From thartmann at openjdk.org Thu Jul 21 12:37:06 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 21 Jul 2022 12:37:06 GMT Subject: RFR: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window [v2] In-Reply-To: References: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> Message-ID: On Wed, 20 Jul 2022 15:10:47 GMT, Tobias Holenstein wrote: >> A graph in IGV can be moved by dragging it with the left mouse button (called panning). >> ![panning](https://user-images.githubusercontent.com/71546117/178509416-24dd900f-131b-484b-af47-c7a78e791434.png) >> >> If the mouse left the visible window of the graph during dragging, the diagram started to move in the opposite direction. This was annoying. Now panning stops as soon as the mouse leaves the window. >> ![stop reverse panning](https://user-images.githubusercontent.com/71546117/178509309-3df03b7a-ada4-45a3-b9a7-d6e10664033d.png) >> >> In selection mode, the graph still moves when the mouse is dragged outside the window, as this is meant to make a larger selection. >> ![keep panning for selection](https://user-images.githubusercontent.com/71546117/178509302-74fa41d2-e611-40a3-b6b0-c937ef4b2462.png) > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > fixed locking in CustomizablePanAction Looks reasonable. src/utils/IdealGraphVisualizer/View/src/main/java/com/sun/hotspot/igv/view/actions/CustomizablePanAction.java line 4: > 2: * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS HEADER. > 3: * > 4: * Copyright (c) 1997, 2015, 2022, Oracle and/or its affiliates. All rights reserved. Suggestion: * Copyright (c) 1997, 2022, Oracle and/or its affiliates. All rights reserved. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9470 From shade at openjdk.org Thu Jul 21 13:10:57 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 21 Jul 2022 13:10:57 GMT Subject: RFR: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 09:46:08 GMT, Aleksey Shipilev wrote: > I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: > > > void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, > Register thread, Register obj, > Register var_size_in_bytes, > int con_size_in_bytes, > Register t1, > Label& slow_case) { > ... > if (!Universe::heap()->supports_inline_contig_alloc()) { > __ jmp(slow_case); > > > ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. > > (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). > > Additional testing: > - [x] Linux x86_32 fastdebug `tier1` All right, thanks. I guess I need to have a formal ack from runtime Reviewer then. :) ------------- PR: https://git.openjdk.org/jdk/pull/9567 From coleenp at openjdk.org Thu Jul 21 13:15:51 2022 From: coleenp at openjdk.org (Coleen Phillimore) Date: Thu, 21 Jul 2022 13:15:51 GMT Subject: RFR: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 09:46:08 GMT, Aleksey Shipilev wrote: > I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: > > > void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, > Register thread, Register obj, > Register var_size_in_bytes, > int con_size_in_bytes, > Register t1, > Label& slow_case) { > ... > if (!Universe::heap()->supports_inline_contig_alloc()) { > __ jmp(slow_case); > > > ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. > > (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). > > Additional testing: > - [x] Linux x86_32 fastdebug `tier1` Seems good to me. ------------- Marked as reviewed by coleenp (Reviewer). PR: https://git.openjdk.org/jdk/pull/9567 From shade at openjdk.org Thu Jul 21 13:26:09 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 21 Jul 2022 13:26:09 GMT Subject: RFR: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 09:46:08 GMT, Aleksey Shipilev wrote: > I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: > > > void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, > Register thread, Register obj, > Register var_size_in_bytes, > int con_size_in_bytes, > Register t1, > Label& slow_case) { > ... > if (!Universe::heap()->supports_inline_contig_alloc()) { > __ jmp(slow_case); > > > ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. > > (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). > > Additional testing: > - [x] Linux x86_32 fastdebug `tier1` Thanks all! ------------- PR: https://git.openjdk.org/jdk/pull/9567 From shade at openjdk.org Thu Jul 21 13:26:10 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 21 Jul 2022 13:26:10 GMT Subject: Integrated: 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled In-Reply-To: References: Message-ID: <6OTfqJQfdAYOxatcdvnhp1gRehqA5setz-DBxAcFj2c=.fdc58c22-45f7-4958-826b-50656967a2cc@github.com> On Wed, 20 Jul 2022 09:46:08 GMT, Aleksey Shipilev wrote: > I have been doing the thread register verification patches for better Loom debugging, and one of the failures it caught is calling `eden_allocate` with garbage thread register. That method actually shortcuts: > > > void BarrierSetAssembler::eden_allocate(MacroAssembler* masm, > Register thread, Register obj, > Register var_size_in_bytes, > int con_size_in_bytes, > Register t1, > Label& slow_case) { > ... > if (!Universe::heap()->supports_inline_contig_alloc()) { > __ jmp(slow_case); > > > ...and does not use the thread. But it is still confusing. Other ports gate the calls to `eden_allocate` with `allow_shared_alloc`, x86 should do the same. > > (This thing would be cleaner when/if we remove the support for contiguous inline allocs altogether, see [JDK-8290706](https://bugs.openjdk.org/browse/JDK-8290706)). > > Additional testing: > - [x] Linux x86_32 fastdebug `tier1` This pull request has now been integrated. Changeset: 59e495e4 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/59e495e4d320b79d1b0ddff3f552f69a01d8dc8d Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8290704: x86: TemplateTable::_new should not call eden_allocate() without contiguous allocs enabled Reviewed-by: kvn, thartmann, coleenp ------------- PR: https://git.openjdk.org/jdk/pull/9567 From tholenstein at openjdk.org Thu Jul 21 14:55:04 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 21 Jul 2022 14:55:04 GMT Subject: RFR: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window [v2] In-Reply-To: References: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> Message-ID: On Tue, 12 Jul 2022 16:24:05 GMT, Vladimir Kozlov wrote: >> Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: >> >> fixed locking in CustomizablePanAction > > Good. thanks @vnkozlov and @TobiHartmann for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/9470 From tholenstein at openjdk.org Thu Jul 21 14:57:28 2022 From: tholenstein at openjdk.org (Tobias Holenstein) Date: Thu, 21 Jul 2022 14:57:28 GMT Subject: Integrated: JDK-8290016: IGV: Fix graph panning when mouse dragged outside of window In-Reply-To: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> References: <5cJlObLZkQGGa5rj2V-w-bamKwe9od2G7EunMdVY8eM=.47289b58-9475-4cbe-b9a3-5abd47488817@github.com> Message-ID: <-bUX6O2cSlaYqUPC6s3rlh5hRHZvAZDd3dX6n2fVsSk=.ae88c7a2-80a3-4a84-bdf6-cb0508e88a0d@github.com> On Tue, 12 Jul 2022 13:48:16 GMT, Tobias Holenstein wrote: > A graph in IGV can be moved by dragging it with the left mouse button (called panning). > ![panning](https://user-images.githubusercontent.com/71546117/178509416-24dd900f-131b-484b-af47-c7a78e791434.png) > > If the mouse left the visible window of the graph during dragging, the diagram started to move in the opposite direction. This was annoying. Now panning stops as soon as the mouse leaves the window. > ![stop reverse panning](https://user-images.githubusercontent.com/71546117/178509309-3df03b7a-ada4-45a3-b9a7-d6e10664033d.png) > > In selection mode, the graph still moves when the mouse is dragged outside the window, as this is meant to make a larger selection. > ![keep panning for selection](https://user-images.githubusercontent.com/71546117/178509302-74fa41d2-e611-40a3-b6b0-c937ef4b2462.png) This pull request has now been integrated. Changeset: 604a115a Author: Tobias Holenstein URL: https://git.openjdk.org/jdk/commit/604a115a5b8a4c8917a496f3bddb67f9f6468b99 Stats: 90 lines in 2 files changed: 25 ins; 26 del; 39 mod 8290016: IGV: Fix graph panning when mouse dragged outside of window Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9470 From duke at openjdk.org Thu Jul 21 15:20:07 2022 From: duke at openjdk.org (Sacha Coppey) Date: Thu, 21 Jul 2022 15:20:07 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V Message-ID: This patch adds a JVMCI implementation for RISC-V. It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It should be tested soon in GraalVM Native Image as well. ------------- Commit messages: - 8290154: [JVMCI] Implement JVMCI for RISC-V Changes: https://git.openjdk.org/jdk/pull/9587/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290154 Stats: 1690 lines in 20 files changed: 1668 ins; 0 del; 22 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From dcubed at openjdk.org Thu Jul 21 15:41:50 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Thu, 21 Jul 2022 15:41:50 GMT Subject: RFR: 8290826: validate-source failures after JDK-8290016 Message-ID: A trivial fix for validate-source failures after JDK-8290016. ------------- Commit messages: - 8290826: validate-source failures after JDK-8290016 Changes: https://git.openjdk.org/jdk/pull/9594/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9594&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290826 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9594.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9594/head:pull/9594 PR: https://git.openjdk.org/jdk/pull/9594 From azvegint at openjdk.org Thu Jul 21 15:41:50 2022 From: azvegint at openjdk.org (Alexander Zvegintsev) Date: Thu, 21 Jul 2022 15:41:50 GMT Subject: RFR: 8290826: validate-source failures after JDK-8290016 In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 15:23:36 GMT, Daniel D. Daugherty wrote: > A trivial fix for validate-source failures after JDK-8290016. Marked as reviewed by azvegint (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9594 From dcubed at openjdk.org Thu Jul 21 15:41:50 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Thu, 21 Jul 2022 15:41:50 GMT Subject: RFR: 8290826: validate-source failures after JDK-8290016 In-Reply-To: References: Message-ID: <3HH-Yln8UWcYl9ywaHnGdoriF6RkmCpAa5T9Nzvytx4=.6aa69167-6d54-4637-a625-f190a5853077@github.com> On Thu, 21 Jul 2022 15:25:04 GMT, Alexander Zvegintsev wrote: >> A trivial fix for validate-source failures after JDK-8290016. > > Marked as reviewed by azvegint (Reviewer). @azvegint - Thanks for the lightning fast review! ------------- PR: https://git.openjdk.org/jdk/pull/9594 From dcubed at openjdk.org Thu Jul 21 15:46:03 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Thu, 21 Jul 2022 15:46:03 GMT Subject: Integrated: 8290826: validate-source failures after JDK-8290016 In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 15:23:36 GMT, Daniel D. Daugherty wrote: > A trivial fix for validate-source failures after JDK-8290016. This pull request has now been integrated. Changeset: 6346c333 Author: Daniel D. Daugherty URL: https://git.openjdk.org/jdk/commit/6346c3338c23255a43b179cbd618990c31c2eabc Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8290826: validate-source failures after JDK-8290016 Reviewed-by: azvegint ------------- PR: https://git.openjdk.org/jdk/pull/9594 From duke at openjdk.org Thu Jul 21 16:37:03 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 21 Jul 2022 16:37:03 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: > `trampoline_call` can do dummy code generation to calculate the size of C2 generated code. This is done in the output phase. In [src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042](https://github.com/openjdk/jdk/blob/e0d361cea91d3dd1450aece73f660b4abb7ce5fa/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042) Loom code needed to generate a trampoline call outside of C2 and without the output phase. This caused test crashes. The project Loom added `trampoline_call1` to workaround the crashes. > > This PR improves detection of C2 output phase which makes `trampoline_call1` redundant. > > Tested the fastdebug/release builds: > - `'gtest`: Passed > - `tier1`...`tier2`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Replace trampoline_call1 with trampoline_call ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9592/files - new: https://git.openjdk.org/jdk/pull/9592/files/8ba17f7b..e732890b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9592&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9592&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9592.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9592/head:pull/9592 PR: https://git.openjdk.org/jdk/pull/9592 From kvn at openjdk.org Thu Jul 21 16:47:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Jul 2022 16:47:06 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 10:55:41 GMT, Tobias Hartmann wrote: > C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. > > The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: > https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 > > However, the control flow analysis does not catch this case. I added the missing check. > > Thanks, > Tobias src/hotspot/share/opto/stringopts.cpp line 1032: > 1030: if (PrintOptimizeStringConcat) { > 1031: tty->print_cr("unexpected control use of Initialize"); > 1032: use->dump(2); What output of `dump(2)` you got in your case? It could be more than needed if `use` has a lot of inputs. How about next to output only interesting info?: ptr->in(0)->dump(); // Initialize node use->dump(1); tty->cr(); ------------- PR: https://git.openjdk.org/jdk/pull/9589 From kvn at openjdk.org Thu Jul 21 17:06:06 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 21 Jul 2022 17:06:06 GMT Subject: RFR: 8290034: Auto vectorize reverse bit operations. In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 08:01:09 GMT, Jatin Bhateja wrote: > Summary of changes: > - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. > - Handle auto-vectorization of Integer/Long.reverse bit operations. > - Backend implementation for these were added with 4th incubation of VectorAPIs. > > Following are performance number for newly added JMH mocro benchmarks:- > > > No-GFNI(CLX): > ============= > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 1.085 us/op > Longs.reverse 500 avgt 2 1.236 us/op > WithOpt: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.104 us/op > Longs.reverse 500 avgt 2 0.255 us/op > > With-GFNI(ICX): > =============== > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.887 us/op > Longs.reverse 500 avgt 2 1.095 us/op > > Without: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.037 us/op > Longs.reverse 500 avgt 2 0.145 us/op > > > Kindly review and share feedback. > > Best Regards, > Jatin Testing passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9535 From xliu at openjdk.org Thu Jul 21 20:05:33 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 21 Jul 2022 20:05:33 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps Message-ID: 8287385: Suppress superficial unstable_if traps ------------- Commit messages: - 8287385: Suppress superficial unstable_if traps Changes: https://git.openjdk.org/jdk/pull/9601/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9601&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287385 Stats: 92 lines in 5 files changed: 71 ins; 6 del; 15 mod Patch: https://git.openjdk.org/jdk/pull/9601.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9601/head:pull/9601 PR: https://git.openjdk.org/jdk/pull/9601 From duke at openjdk.org Thu Jul 21 20:28:01 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 21 Jul 2022 20:28:01 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 16:37:03 GMT, Evgeny Astigeevich wrote: >> `trampoline_call` can do dummy code generation to calculate the size of C2 generated code. This is done in the output phase. In [src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042](https://github.com/openjdk/jdk/blob/e0d361cea91d3dd1450aece73f660b4abb7ce5fa/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042) Loom code needed to generate a trampoline call outside of C2 and without the output phase. This caused test crashes. The project Loom added `trampoline_call1` to workaround the crashes. >> >> This PR improves detection of C2 output phase which makes `trampoline_call1` redundant. >> >> Tested the fastdebug/release builds: >> - `'gtest`: Passed >> - `tier1`...`tier2`: Passed > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Replace trampoline_call1 with trampoline_call Andrew (@theRealAph), could you please have a look? Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/9592 From dean.long at oracle.com Fri Jul 22 03:28:23 2022 From: dean.long at oracle.com (dean.long at oracle.com) Date: Thu, 21 Jul 2022 20:28:23 -0700 Subject: Crash due to bad oop map In-Reply-To: <4f48521a-31a3-4a51-9c78-9346c6f968b9.denghui.ddh@alibaba-inc.com> References: <4f48521a-31a3-4a51-9c78-9346c6f968b9.denghui.ddh@alibaba-inc.com> Message-ID: Hi Denghui Dong. This looks like JDK-8051805 (https://bugs.openjdk.org/browse/JDK-8051805). As you discovered, it is triggered when RegN is used as a TEMP. As a work-around, have you tried using rRegI instead of rRegN for your TEMPs? dl On 7/21/22 12:11 AM, Denghui Dong wrote: > Hi team, > > We encountered a crash due to a bad oop map. > The following steps can quickly reproduce the problem on JDK master on > Linux x64: > 1. Modify z_x86_64.ad and add 3 temporary registers of type rRegN to > zLoadP(If this change is illegal, please let me know) > ``` > diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > index f3e19b41733..129c93d3da8 100644 > --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > @@ -63,11 +63,11 @@ static void z_load_barrier_cmpxchg(MacroAssembler& > _masm, const MachNode* node, > %} > // Load Pointer > -instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) > +instruct zLoadP(rRegP dst, memory mem, rRegN tmp1, rRegN tmp2, rRegN > tmp3, rFlagsReg cr) > %{ > ? predicate(UseZGC && n->as_Load()->barrier_data() != 0); > ? match(Set dst (LoadP mem)); > -? effect(KILL cr, TEMP dst); > +? effect(KILL cr, TEMP dst, TEMP tmp1, TEMP tmp2, TEMP tmp3); > ? ins_cost(125); > ``` > 2. Run the following code > (./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:-Inline > -XX:CompileCommand=compileonly,Test::call -XX:+PrintAssembly > -XX:PrintIdealGraphLevel=2 -XX:PrintIdealGraphFile=call.xml -XX:+UseZGC > -XX:+UseNewCode -XX:+PrintGC -Xmx500m -Xms500m -XX:-Inline Test) > ``` > import java.util.concurrent.locks.Lock; > import java.util.concurrent.locks.ReentrantLock; > class Test { > ? private Lock lock = new ReentrantLock(); > ? public static void main(String... args) throws Exception { > ? ? new Thread(() -> { > ? ? ? while (true) { > ? ? ? ? byte[] b = new byte[10 * 1024 * 1024]; > ? ? ? } > ? ? }).start(); > ? ? while (true) { > ? ? ? new Test().call(() -> { new Object(); }); > ? ? } > ? } > ? public void call(Runnable cb) { > ? ? lock.lock(); > ? ? try { > ? ? ? cb.run(); > ? ? } finally { > ? ? ? lock.unlock(); > ? ? } > ? } > } > ``` > > It can be observed through the IGV Final Code that zLoadP(202, 204) and > a MachTemp(81) it uses are placed in different blocks. When processing > CallStaticJavaDirect(74) in the c2 buildOopMap step, MachTemp(81) is > considered to be alive. Because the register type specified by the above > modification is rRegN (if it is specified as rRegI or rRegP, no crash > will occur), so the type of MachTemp(81) is narrowOop, which will > eventually be added to the oopMap of CallStaticJavaDirect(74). But > logically, zLoadP doesn't depend on the original value of MachTemp(81), > so I don't think it should be added to oopMap. > > I put the ideal graph file to http://cr.openjdk.java.net/~ddong/call.xml > > > I think the problem may lie in two places: > 1. gcm and regAlloc do not put MachTemp(81) in the correct location > 2. buildOopMap should not treat MachTemp as Oop because its original > value is meaningless > > I'm not a JIT expert and sorry if there is anything unnatural or > non-standard in the above description. > > Any input is appreciated. > > Denghui Dong From pli at openjdk.org Fri Jul 22 03:57:22 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 22 Jul 2022 03:57:22 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v3] In-Reply-To: References: Message-ID: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: Address more comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9508/files - new: https://git.openjdk.org/jdk/pull/9508/files/8a3c704a..c042d4ac Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9508&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9508&range=01-02 Stats: 79 lines in 3 files changed: 26 ins; 12 del; 41 mod Patch: https://git.openjdk.org/jdk/pull/9508.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9508/head:pull/9508 PR: https://git.openjdk.org/jdk/pull/9508 From pli at openjdk.org Fri Jul 22 04:12:02 2022 From: pli at openjdk.org (Pengfei Li) Date: Fri, 22 Jul 2022 04:12:02 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v2] In-Reply-To: References: Message-ID: On Tue, 19 Jul 2022 10:13:56 GMT, Quan Anh Mai wrote: >> Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: >> >> Address comment from merykitty > > Thanks, it looks good to me, maybe we would need similar tests for long range checks also. @merykitty @rose00 Thanks for all your comments. Please see my updated code. After some investigation on memory segment accesses, I see this does help 64-bit long range checks. As memory segment access introduces quite a lot of IRs, I rewrite the jtreg case and turn to checking the output of `-XX:+TraceLoopPredicate` to make the test more accurate. I have re-tested tier1, also checked that the new jtreg would fail without my changes in `mulnode.cpp`. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From haosun at openjdk.org Fri Jul 22 05:25:27 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 22 Jul 2022 05:25:27 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v2] In-Reply-To: References: Message-ID: > **MOTIVATION** > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > **Drawback-2**: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > **KEY UPDATES** > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > - Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > - Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves **Drawback-2**. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > **MINOR UPDATES and CODE REFACTORING** > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > - Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > - Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > - Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > **TESTING** > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > **BENEFITS** > > The number of matching rules is reduced to ~ **42%**. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313 (aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (**nearly 1 MB reduction**) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Add m4 file Add the corresponding M4 file - Add VM_Version flag to control NEON instruction generation Add VM_Version flag use_neon_for_vector() to control whether to generate NEON instructions for 128-bit vector operations. Currently only vector length is checked inside and it returns true for existing SVE cores. More specific things might be checked in the near future, e.g., the basic data type or SVE CPU model. Besides, new macro assembler helpers neon_vector_extend/narrow() are introduced to make the code clean. Note: AddReductionVF/D rules are updated so that SVE instructions are generated for 64/128-bit vector operations, because floating point reduction add instructions are supported directly in SVE. - Merge branch 'master' as of 7th-July into 8285790-merge-rules - 8285790: AArch64: Merge C2 NEON and SVE matching rules MOTIVATION This is a big refactoring patch of merging rules in aarch64_sve.ad and aarch64_neon.ad. The motivation can also be found at [1]. Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE and NEON codegen respectively. 1) For SVE rules we use vReg operand to match VecA for an arbitrary length of vector type, when SVE is enabled; 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for 128-bit/64-bit vectors, when SVE is not enabled. This separation looked clean at the time of introducing SVE support. However, there are two main drawbacks now. Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and SVE vector registers share the lower 128 bits with NEON registers. For some cases, even when SVE is enabled, we still prefer to match NEON rules and emit NEON instructions. Drawback-2: With more and more vector rules added to support VectorAPI, there are lots of rules in both two ad files with different predication conditions, e.g., different values of UseSVE or vector type/size. Examples can be found in [1]. These two drawbacks make the code less maintainable and increase the libjvm.so code size. KEY UPDATES In this patch, we mainly do two things, using generic vReg to match all NEON/SVE vector registers and merging NEON/SVE matching rules. Update-1: Use generic vReg to match all NEON/SVE vector registers Two different approaches were considered, and we prefer to use generic vector solution but keep VecA operand for all >128-bit vectors. See the last slide in [1]. All the changes lie in the AArch64 backend. 1) Some helpers are updated in aarch64.ad to enable generic vector on AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), is_reg2reg_move() and is_generic_vector(). 2) Operand vecA is created to match VecA register, and vReg is updated to match VecA/D/X registers dynamically. With the introduction of generic vReg, difference in register types between NEON rules and SVE rules can be eliminated, which makes it easy to merge these rules. Update-2: Try to merge existing rules As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is introduced to hold the grouped and merged matching rules. 1) Similar rules with difference in vector type/size can be merged into new rules, where different types and vector sizes are handled in the codegen part, e.g., vadd(). This resolves Drawback-2. 2) In most cases, we tend to emit NEON instructions for 128-bit vector operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. It's important to note that there are some exceptions. Exception-1: For some rules, there are no direct NEON instructions, but exists simple SVE implementation due to newly added SVE ISA. Such rules include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. Exception-2: Vector mask generation and operation rules are different because vector mask is stored in different types of registers between NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. Exception-3: Shift right related rules are different because vector shift right instructions differ a bit between NEON and SVE. For these exceptions, we emit NEON or SVE code simply based on UseSVE options. MINOR UPDATES and CODE REFACTORING Since we've touched all lines of code during merging rules, we further do more minor updates and refactoring. 1. Reduce regmask bits Stack slot alignment is handled specially for scalable vector, which will firstly align to SlotsPerVecA, and then align to the real vector length. We should guarantee SlotsPerVecA is no bigger than the real vector length. Otherwise, unused stack space would be allocated. In AArch64 SVE, the vector length can be 128 to 2048 bits. However, SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, on a 128-bit SVE platform, the stack slot is aligned to 256 bits, leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA from 8 to 4. See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad (chunk1 and vectora_reg). 2. Refactor NEON/SVE vector op support check. Merge NEON and SVE vector supported check into one single function. To be consistent, SVE default size supported check now is relaxed from no less than 64 bits to the same condition as NEON's min_vector_size(), i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, as we assume at least we will emit NEON code for those small vectors, with unified rules. 3. Some notes for new rules 1) Since new rules are unique and it makes no sense to set different "ins_cost", we turn to use the default cost. 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad now. Hence, many SIMD pipeline classes at aarch64.ad become unused and can be removed. 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the matching rule names if needed. a) 'le128b' means the vector length is less than or equal to 128 bits. This rule can be matched on both NEON and 128-bit SVE. b) 'gt128b' means the vector length is greater than 128 bits. This rule can only be matched on SVE. c) 'neon' means this rule can only be matched on NEON, i.e. the generated instruction is not better than those in 128-bit SVE. d) 'sve' means this rule is only matched on SVE for all possible vector length, i.e. not limited to gt128b. Note-1: m4 file is not introduced because many duplications are highly reduced now. Note-2: We guess the code review for this big patch would probably take some time and we may need to merge latest code from master branch from time to time. We prefer to keep aarch64_neon/sve.ad and the corresponding m4 files for easy comparison and review. Of course, they will be finally removed after some solid reviews before integration. Note-3: Several other minor refactorings are done in this patch, but we cannot list all of them in the commit message. We have reviewed and tested the rules carefully to guarantee the quality. TESTING 1) Cross compilations on arm32/s390/pps/riscv passed. 2) tier1~3 jtreg passed on both x64 and aarch64 machines. 3) vector tests: all the test cases under the following directories can pass on both NEON and SVE systems with max vector length 16/32/64 bytes. "test/hotspot/jtreg/compiler/vectorapi/" "test/jdk/jdk/incubator/vector/" "test/hotspot/jtreg/compiler/vectorization/" 4) Performance evaluation: we choose vector micro-benchmarks from panama-vector:vectorIntrinsics [2] to evaluate the performance of this patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE platform and one NEON platform, and didn't see any visiable regression with NEON and SVE. We will continue to verify more cases on other platforms with NEON and different SVE vector sizes. BENEFITS The number of matching rules is reduced to ~ 42%. before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 after : 313(aarch64_vector.ad) Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. before: 25246528 B (commit 7905788e969) after : 24208776 B (nearly 1 MB reduction) [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation Co-Developed-by: Ningsheng Jian Co-Developed-by: Eric Liu ------------- Changes: https://git.openjdk.org/jdk/pull/9346/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=01 Stats: 12309 lines in 15 files changed: 11604 ins; 582 del; 123 mod Patch: https://git.openjdk.org/jdk/pull/9346.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346 PR: https://git.openjdk.org/jdk/pull/9346 From thartmann at openjdk.org Fri Jul 22 05:38:01 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 22 Jul 2022 05:38:01 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: > C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. > > The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: > https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 > > However, the control flow analysis does not catch this case. I added the missing check. > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Modified debug printing code ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9589/files - new: https://git.openjdk.org/jdk/pull/9589/files/e1de6874..d299e2ea Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9589&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9589&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9589.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9589/head:pull/9589 PR: https://git.openjdk.org/jdk/pull/9589 From haosun at openjdk.org Fri Jul 22 05:38:04 2022 From: haosun at openjdk.org (Hao Sun) Date: Fri, 22 Jul 2022 05:38:04 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> References: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> Message-ID: On Mon, 4 Jul 2022 12:51:22 GMT, Andrew Haley wrote: >> Aha! I was looking forward to this. >> >> On 7/1/22 11:46, Hao Sun wrote: >> > Note-1: m4 file is not introduced because many duplications are highly >> > reduced now. >> >> Yes, but there's still a lot of duplications. I'll make a few examples >> of where you should make simple changes that will usefully increase the >> level of abstraction. That will be a start. > >> @theRealAph Thanks for your comment. Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file. >> >> But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform). After that we may discuss whether to use m4 file and how to implement it if needed. > > We can do both: there's no sense in which one excludes the other, and we have time. > > However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this: > > > if (length_in_bytes <= 16) { > // ... Neon > } else { > assert(UseSVE > 0, "must be sve"); > // ... SVE > } > > > which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption? Hi @theRealAph , three commits are uploaded. Could you help take a look at them when you have spare time? Thanks. commit-1: merge the `master` branch as of 7th-July. Of course, more merges would be done during the follow-up review process. commit-2: add one `VM_Version` flag to control whether to generate NEON instructions for 64/128-bit vector operations on SVE. commit-3: add the m4 file. We tried to make as many abstractions as possible in the m4 file. Before m4 is introduced, `aarch64_vector.ad` file is ~5k LOC. And now with this commit, we use ~4k LOC `aarch64_vector_ad.m4` file, i.e. only ~20% reduction. I personally think the reduction is not that big, compared to the reductions between `aarch64_neon/sve_ad.m4` and `aarch64_neon/sve.ad`. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From thartmann at openjdk.org Fri Jul 22 05:38:06 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 22 Jul 2022 05:38:06 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 16:43:11 GMT, Vladimir Kozlov wrote: >> Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: >> >> Modified debug printing code > > src/hotspot/share/opto/stringopts.cpp line 1032: > >> 1030: if (PrintOptimizeStringConcat) { >> 1031: tty->print_cr("unexpected control use of Initialize"); >> 1032: use->dump(2); > > What output of `dump(2)` you got in your case? It could be more than needed if `use` has a lot of inputs. > How about next to output only interesting info?: > > ptr->in(0)->dump(); // Initialize node > use->dump(1); > tty->cr(); It prints: considering toString call in SideEffectBeforeConstructor::test @ bci:16 unexpected control use of Initialize 49 ConI === 0 [[ 50 ]] #int:1 48 LoadI === _ 7 47 [[ 50 ]] @java/lang/Class:exact+112 *, name=result, idx=11; #int !jvms: SideEffectBeforeConstructor::test @ bci:4 46 ConL === 0 [[ 47 ]] #long:112 45 ConP === 0 [[ 47 47 ]] #java/lang/Class:exact * Oop:java/lang/Class:exact * 3 Start === 3 0 [[ 3 5 6 7 8 9 10 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:java/lang/String:exact *} 38 Initialize === 30 1 41 1 1 37 [[ 39 40 ]] !jvms: SideEffectBeforeConstructor::test @ bci:0 50 AddI === _ 48 49 [[ 52 ]] !jvms: SideEffectBeforeConstructor::test @ bci:8 47 AddP === _ 45 45 46 [[ 48 52 ]] Oop:java/lang/Class:exact+112 * !jvms: SideEffectBeforeConstructor::test @ bci:4 7 Parm === 3 [[ 52 48 41 41 24 25 72 41 41 41 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: SideEffectBeforeConstructor::test @ bci:-1 39 Proj === 38 [[ 53 42 52 ]] #0 !jvms: SideEffectBeforeConstructor::test @ bci:0 52 StoreI === 39 7 47 50 [[ 24 ]] @java/lang/Class:exact+112 *, name=result, idx=11; Memory: @java/lang/Class:exact+112 *, name=result, idx=11; !jvms: SideEffectBeforeConstructor::test @ bci:9 You are right, `dump(1)` is sufficient: considering toString call in SideEffectBeforeConstructor::test @ bci:16 unexpected control use of Initialize 38 Initialize === 30 1 41 1 1 37 [[ 39 40 ]] !jvms: SideEffectBeforeConstructor::test @ bci:0 50 AddI === _ 48 49 [[ 52 ]] !jvms: SideEffectBeforeConstructor::test @ bci:8 47 AddP === _ 45 45 46 [[ 48 52 ]] Oop:java/lang/Class:exact+112 * !jvms: SideEffectBeforeConstructor::test @ bci:4 7 Parm === 3 [[ 52 48 41 41 24 25 72 41 41 41 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: SideEffectBeforeConstructor::test @ bci:-1 39 Proj === 38 [[ 53 42 52 ]] #0 !jvms: SideEffectBeforeConstructor::test @ bci:0 52 StoreI === 39 7 47 50 [[ 24 ]] @java/lang/Class:exact+112 *, name=result, idx=11; Memory: @java/lang/Class:exact+112 *, name=result, idx=11; !jvms: SideEffectBeforeConstructor::test @ bci:9 ``` The `tty->cr();` is not needed because it's printed by this code just below: https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L1075-L1077 ------------- PR: https://git.openjdk.org/jdk/pull/9589 From denghui.ddh at alibaba-inc.com Fri Jul 22 05:57:21 2022 From: denghui.ddh at alibaba-inc.com (Denghui Dong) Date: Fri, 22 Jul 2022 13:57:21 +0800 Subject: =?UTF-8?B?UmU6IENyYXNoIGR1ZSB0byBiYWQgb29wIG1hcA==?= In-Reply-To: References: <4f48521a-31a3-4a51-9c78-9346c6f968b9.denghui.ddh@alibaba-inc.com>, Message-ID: <892516fc-6123-4dec-96f4-a7f7ab3496dd.denghui.ddh@alibaba-inc.com> Hi @dean-long, I tried, and it works. Another way is to skip the processing of MachTemp when building the oop map. Do you think this is the correct way? Denghui Dong ------------------------------------------------------------------ From:dean.long Send Time:2022?7?22?(???) 11:28 To:undefined ; undefined Subject:Re: Crash due to bad oop map Hi Denghui Dong. This looks like JDK-8051805 (https://bugs.openjdk.org/browse/JDK-8051805). As you discovered, it is triggered when RegN is used as a TEMP. As a work-around, have you tried using rRegI instead of rRegN for your TEMPs? dl On 7/21/22 12:11 AM, Denghui Dong wrote: > Hi team, > > We encountered a crash due to a bad oop map. > The following steps can quickly reproduce the problem on JDK master on > Linux x64: > 1. Modify z_x86_64.ad and add 3 temporary registers of type rRegN to > zLoadP(If this change is illegal, please let me know) > ``` > diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > index f3e19b41733..129c93d3da8 100644 > --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > @@ -63,11 +63,11 @@ static void z_load_barrier_cmpxchg(MacroAssembler& > _masm, const MachNode* node, > %} > // Load Pointer > -instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) > +instruct zLoadP(rRegP dst, memory mem, rRegN tmp1, rRegN tmp2, rRegN > tmp3, rFlagsReg cr) > %{ > predicate(UseZGC && n->as_Load()->barrier_data() != 0); > match(Set dst (LoadP mem)); > - effect(KILL cr, TEMP dst); > + effect(KILL cr, TEMP dst, TEMP tmp1, TEMP tmp2, TEMP tmp3); > ins_cost(125); > ``` > 2. Run the following code > (./build/linux-x86_64-server-fastdebug/images/jdk/bin/java -XX:-Inline > -XX:CompileCommand=compileonly,Test::call -XX:+PrintAssembly > -XX:PrintIdealGraphLevel=2 -XX:PrintIdealGraphFile=call.xml -XX:+UseZGC > -XX:+UseNewCode -XX:+PrintGC -Xmx500m -Xms500m -XX:-Inline Test) > ``` > import java.util.concurrent.locks.Lock; > import java.util.concurrent.locks.ReentrantLock; > class Test { > private Lock lock = new ReentrantLock(); > public static void main(String... args) throws Exception { > new Thread(() -> { > while (true) { > byte[] b = new byte[10 * 1024 * 1024]; > } > }).start(); > while (true) { > new Test().call(() -> { new Object(); }); > } > } > public void call(Runnable cb) { > lock.lock(); > try { > cb.run(); > } finally { > lock.unlock(); > } > } > } > ``` > > It can be observed through the IGV Final Code that zLoadP(202, 204) and > a MachTemp(81) it uses are placed in different blocks. When processing > CallStaticJavaDirect(74) in the c2 buildOopMap step, MachTemp(81) is > considered to be alive. Because the register type specified by the above > modification is rRegN (if it is specified as rRegI or rRegP, no crash > will occur), so the type of MachTemp(81) is narrowOop, which will > eventually be added to the oopMap of CallStaticJavaDirect(74). But > logically, zLoadP doesn't depend on the original value of MachTemp(81), > so I don't think it should be added to oopMap. > > I put the ideal graph file to http://cr.openjdk.java.net/~ddong/call.xml > > > I think the problem may lie in two places: > 1. gcm and regAlloc do not put MachTemp(81) in the correct location > 2. buildOopMap should not treat MachTemp as Oop because its original > value is meaningless > > I'm not a JIT expert and sorry if there is anything unnatural or > non-standard in the above description. > > Any input is appreciated. > > Denghui Dong -------------- next part -------------- An HTML attachment was scrubbed... URL: From jwaters at openjdk.org Fri Jul 22 07:33:59 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 07:33:59 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v2] In-Reply-To: References: Message-ID: > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. Julian Waters has updated the pull request incrementally with one additional commit since the last revision: Correct comment with respect to review ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9598/files - new: https://git.openjdk.org/jdk/pull/9598/files/57d60a68..7e485ac4 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9598/head:pull/9598 PR: https://git.openjdk.org/jdk/pull/9598 From jwaters at openjdk.org Fri Jul 22 07:34:02 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 07:34:02 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v2] In-Reply-To: <1RRGLN3oyV27P-eCojRAE4rVVlFqddZfuVKOTHZfJpI=.828857c9-41b4-4299-8796-3a3eca4c1735@github.com> References: <1RRGLN3oyV27P-eCojRAE4rVVlFqddZfuVKOTHZfJpI=.828857c9-41b4-4299-8796-3a3eca4c1735@github.com> Message-ID: On Fri, 22 Jul 2022 01:13:50 GMT, Dean Long wrote: >> Julian Waters has updated the pull request incrementally with one additional commit since the last revision: >> >> Correct comment with respect to review > > src/hotspot/share/oops/methodData.hpp line 41: > >> 39: >> 40: // The MethodData object collects counts and other profile information >> 41: // during zeroth-tier (interpreter) execution. > > This should probably say levels 0 and 3. Updated, thanks for the correction ------------- PR: https://git.openjdk.org/jdk/pull/9598 From dean.long at oracle.com Fri Jul 22 07:34:27 2022 From: dean.long at oracle.com (dean.long at oracle.com) Date: Fri, 22 Jul 2022 00:34:27 -0700 Subject: [External] : Re: Crash due to bad oop map In-Reply-To: <892516fc-6123-4dec-96f4-a7f7ab3496dd.denghui.ddh@alibaba-inc.com> References: <4f48521a-31a3-4a51-9c78-9346c6f968b9.denghui.ddh@alibaba-inc.com> <892516fc-6123-4dec-96f4-a7f7ab3496dd.denghui.ddh@alibaba-inc.com> Message-ID: <304d5938-0d05-59b2-d87d-598d768ac140@oracle.com> On 7/21/22 10:57 PM, Denghui Dong wrote: > Hi @dean-long, > > I tried, and it works. > > Another way is to skip the processing of MachTemp when building the oop > map. Do you think this is the correct way? That might not be the best fix. It sounds like a special case to skip MachTemp for oop maps. The fact that the MachTemp gets scheduled into a different block seems like the real problem. dl > Denghui Dong > > ------------------------------------------------------------------ > From:dean.long > Send Time:2022?7?22?(???) 11:28 > To:undefined ; undefined > Subject:Re: Crash due to bad oop map > > Hi Denghui Dong.? This looks like JDK-8051805 > (https://bugs.openjdk.org/browse/JDK-8051805).? As you discovered, > it is > triggered when RegN is used as a TEMP.? As a work-around, have you > tried > using rRegI instead of rRegN for your TEMPs? > > dl > > On 7/21/22 12:11 AM, Denghui Dong wrote: > > Hi team, > > > > We encountered a crash due to a bad oop map. > > The following steps can quickly reproduce the problem on JDK > master on > > Linux x64: > > 1. Modify z_x86_64.ad and add 3 temporary registers of type rRegN to > > zLoadP(If this change is illegal, please let me know) > > ``` > > diff --git a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > > b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > > index f3e19b41733..129c93d3da8 100644 > > --- a/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > > +++ b/src/hotspot/cpu/x86/gc/z/z_x86_64.ad > > @@ -63,11 +63,11 @@ static void > z_load_barrier_cmpxchg(MacroAssembler& > > _masm, const MachNode* node, > > %} > > // Load Pointer > > -instruct zLoadP(rRegP dst, memory mem, rFlagsReg cr) > > +instruct zLoadP(rRegP dst, memory mem, rRegN tmp1, rRegN tmp2, > rRegN > > tmp3, rFlagsReg cr) > > %{ > >? ? predicate(UseZGC && n->as_Load()->barrier_data() != 0); > >? ? match(Set dst (LoadP mem)); > > -? effect(KILL cr, TEMP dst); > > +? effect(KILL cr, TEMP dst, TEMP tmp1, TEMP tmp2, TEMP tmp3); > >? ? ins_cost(125); > > ``` > > 2. Run the following code > > (./build/linux-x86_64-server-fastdebug/images/jdk/bin/java > -XX:-Inline > > -XX:CompileCommand=compileonly,Test::call -XX:+PrintAssembly > > -XX:PrintIdealGraphLevel=2 -XX:PrintIdealGraphFile=call.xml > -XX:+UseZGC > > -XX:+UseNewCode -XX:+PrintGC -Xmx500m -Xms500m -XX:-Inline Test) > > ``` > > import java.util.concurrent.locks.Lock; > > import java.util.concurrent.locks.ReentrantLock; > > class Test { > >? ? private Lock lock = new ReentrantLock(); > >? ? public static void main(String... args) throws Exception { > >? ? ? new Thread(() -> { > >? ? ? ? while (true) { > >? ? ? ? ? byte[] b = new byte[10 * 1024 * 1024]; > >? ? ? ? } > >? ? ? }).start(); > >? ? ? while (true) { > >? ? ? ? new Test().call(() -> { new Object(); }); > >? ? ? } > >? ? } > >? ? public void call(Runnable cb) { > >? ? ? lock.lock(); > >? ? ? try { > >? ? ? ? cb.run(); > >? ? ? } finally { > >? ? ? ? lock.unlock(); > >? ? ? } > >? ? } > > } > > ``` > > > > It can be observed through the IGV Final Code that zLoadP(202, > 204) and > > a MachTemp(81) it uses are placed in different blocks. When > processing > > CallStaticJavaDirect(74) in the c2 buildOopMap step, MachTemp(81) is > > considered to be alive. Because the register type specified by > the above > > modification is rRegN (if it is specified as rRegI or rRegP, no > crash > > will occur), so the type of MachTemp(81) is narrowOop, which will > > eventually be added to the oopMap of CallStaticJavaDirect(74). But > > logically, zLoadP doesn't depend on the original value of > MachTemp(81), > > so I don't think it should be added to oopMap. > > > > I put the ideal graph file to > http://cr.openjdk.java.net/~ddong/call.xml > > > > > > I think the problem may lie in two places: > > 1. gcm and regAlloc do not put MachTemp(81) in the correct location > > 2. buildOopMap should not treat MachTemp as Oop because its original > > value is meaningless > > > > I'm not a JIT expert and sorry if there is anything unnatural or > > non-standard in the above description. > > > > Any input is appreciated. > > > > Denghui Dong From jwaters at openjdk.org Fri Jul 22 07:43:11 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 07:43:11 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v2] In-Reply-To: <1RRGLN3oyV27P-eCojRAE4rVVlFqddZfuVKOTHZfJpI=.828857c9-41b4-4299-8796-3a3eca4c1735@github.com> References: <1RRGLN3oyV27P-eCojRAE4rVVlFqddZfuVKOTHZfJpI=.828857c9-41b4-4299-8796-3a3eca4c1735@github.com> Message-ID: On Fri, 22 Jul 2022 01:11:04 GMT, Dean Long wrote: >> Julian Waters has updated the pull request incrementally with one additional commit since the last revision: >> >> Correct comment with respect to review > > src/hotspot/share/compiler/compilationPolicy.hpp line 48: > >> 46: * all data from the MDO will be loaded into the ciMethodData when it is first created. >> 47: * (See ciMethod::method_data() in ciMethod.cpp for more details) >> 48: * > > The ciMethodData is just a temporary snapshot. Updates to the profiling data is still done through the MethodData. The only place I could find a MethodData being created is in the method // Build a MethodData* object to hold information about this method // collected in the interpreter. void Method::build_interpreter_method_data(const methodHandle& method, TRAPS) which presumably constructs a MethodData for profiling in the Interpreter. Is there a different area where it's created (when profiling with C1) that I missed? ------------- PR: https://git.openjdk.org/jdk/pull/9598 From aph at openjdk.org Fri Jul 22 08:19:04 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 22 Jul 2022 08:19:04 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 16:37:03 GMT, Evgeny Astigeevich wrote: >> `trampoline_call` can do dummy code generation to calculate the size of C2 generated code. This is done in the output phase. In [src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042](https://github.com/openjdk/jdk/blob/e0d361cea91d3dd1450aece73f660b4abb7ce5fa/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042) Loom code needed to generate a trampoline call outside of C2 and without the output phase. This caused test crashes. The project Loom added `trampoline_call1` to workaround the crashes. >> >> This PR improves detection of C2 output phase which makes `trampoline_call1` redundant. >> >> Tested the fastdebug/release builds: >> - `'gtest`: Passed >> - `tier1`...`tier2`: Passed > > Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: > > Replace trampoline_call1 with trampoline_call src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 637: > 635: // code. > 636: PhaseOutput* phase_output = Compile::current()->output(); > 637: in_scratch_emit_size = Looks reasonable enough. The only change is to check for `Compile::current()->output()` being null, right? ------------- PR: https://git.openjdk.org/jdk/pull/9592 From dlong at openjdk.org Fri Jul 22 08:56:10 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 22 Jul 2022 08:56:10 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 07:33:59 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > Correct comment with respect to review C1 uses ciMethod::ensure_method_data(), which calls Method::build_interpreter_method_data(), to create an MDO if one wasn't already created by the interpreter. So the name build_interpreter_method_data() is a bit misleading, because C1 will use the same MDO as the interpreter. I also found a comment in c1_globals.hpp about C1UpdateMethodData that mentions tier1. I think the comment should be changed to say tier3. ------------- PR: https://git.openjdk.org/jdk/pull/9598 From duke at openjdk.org Fri Jul 22 09:47:25 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Fri, 22 Jul 2022 09:47:25 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." Message-ID: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> ? "IRViolationException: There were one or multiple IR rule failures." The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. ------------- Commit messages: - 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." Changes: https://git.openjdk.org/jdk/pull/9610/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9610&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290730 Stats: 4 lines in 2 files changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9610.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9610/head:pull/9610 PR: https://git.openjdk.org/jdk/pull/9610 From jiefu at openjdk.org Fri Jul 22 11:31:05 2022 From: jiefu at openjdk.org (Jie Fu) Date: Fri, 22 Jul 2022 11:31:05 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. LGTM ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/9610 From jwaters at openjdk.org Fri Jul 22 11:37:04 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 11:37:04 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v3] In-Reply-To: References: Message-ID: <2kb66TqyzzMBV6rDkuLvzC4rXegtF6mm_xYT4TfgcD8=.ec45036a-2227-416a-bba6-25cb9f959893@github.com> > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. Julian Waters has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains six additional commits since the last revision: - Merge branch 'profile' of https://github.com/TheShermanTanker/jdk into profile - Correct comment with respect to review - Update compilationPolicy.hpp - Minor comment cleanup - Merge remote-tracking branch 'upstream/master' into profile - Better clarify documentation ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9598/files - new: https://git.openjdk.org/jdk/pull/9598/files/7e485ac4..fba6dd27 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=01-02 Stats: 318 lines in 45 files changed: 213 ins; 20 del; 85 mod Patch: https://git.openjdk.org/jdk/pull/9598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9598/head:pull/9598 PR: https://git.openjdk.org/jdk/pull/9598 From jwaters at openjdk.org Fri Jul 22 11:45:05 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 11:45:05 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v4] In-Reply-To: References: Message-ID: > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. Julian Waters has updated the pull request incrementally with one additional commit since the last revision: Rectify incorrect Tier 1 message for C1UpdateMethodData ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9598/files - new: https://git.openjdk.org/jdk/pull/9598/files/fba6dd27..f6dc3526 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=02-03 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9598/head:pull/9598 PR: https://git.openjdk.org/jdk/pull/9598 From jwaters at openjdk.org Fri Jul 22 11:47:09 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 11:47:09 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v5] In-Reply-To: References: Message-ID: > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. Julian Waters has updated the pull request incrementally with one additional commit since the last revision: Quick fix ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9598/files - new: https://git.openjdk.org/jdk/pull/9598/files/f6dc3526..bf838048 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9598/head:pull/9598 PR: https://git.openjdk.org/jdk/pull/9598 From duke at openjdk.org Fri Jul 22 11:57:08 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Fri, 22 Jul 2022 11:57:08 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. There are a couple of macos failures due to "wget". Should these runs be triggered again? ------------- PR: https://git.openjdk.org/jdk/pull/9610 From thartmann at openjdk.org Fri Jul 22 12:25:07 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Fri, 22 Jul 2022 12:25:07 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. I'll run testing and report back once it passed. ------------- PR: https://git.openjdk.org/jdk/pull/9610 From jwaters at openjdk.org Fri Jul 22 13:42:09 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 13:42:09 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: Message-ID: > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. Julian Waters has updated the pull request incrementally with one additional commit since the last revision: New changes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9598/files - new: https://git.openjdk.org/jdk/pull/9598/files/bf838048..c48d67f3 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=04-05 Stats: 15 lines in 2 files changed: 4 ins; 0 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/9598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9598/head:pull/9598 PR: https://git.openjdk.org/jdk/pull/9598 From jwaters at openjdk.org Fri Jul 22 13:42:11 2022 From: jwaters at openjdk.org (Julian Waters) Date: Fri, 22 Jul 2022 13:42:11 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: <1RRGLN3oyV27P-eCojRAE4rVVlFqddZfuVKOTHZfJpI=.828857c9-41b4-4299-8796-3a3eca4c1735@github.com> Message-ID: On Fri, 22 Jul 2022 07:41:09 GMT, Julian Waters wrote: >> src/hotspot/share/compiler/compilationPolicy.hpp line 48: >> >>> 46: * all data from the MDO will be loaded into the ciMethodData when it is first created. >>> 47: * (See ciMethod::method_data() in ciMethod.cpp for more details) >>> 48: * >> >> The ciMethodData is just a temporary snapshot. Updates to the profiling data is still done through the MethodData. > > The only place I could find a MethodData being created is in the method > > // Build a MethodData* object to hold information about this method > // collected in the interpreter. > void Method::build_interpreter_method_data(const methodHandle& method, TRAPS) > > which presumably constructs a MethodData for profiling in the Interpreter. Is there a different area where it's created (when profiling with C1) that I missed? hopefully resolved by the newest changes ------------- PR: https://git.openjdk.org/jdk/pull/9598 From duke at openjdk.org Fri Jul 22 14:29:00 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Fri, 22 Jul 2022 14:29:00 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 08:15:50 GMT, Andrew Haley wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Replace trampoline_call1 with trampoline_call > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 637: > >> 635: // code. >> 636: PhaseOutput* phase_output = Compile::current()->output(); >> 637: in_scratch_emit_size = > > Looks reasonable enough. The only change is to check for `Compile::current()->output()` being null, right? Do you mean `Compile::current()`? `Compile::current()->output()` is checked in the expression for `in_scratch_emit_size` as `phase_output != NULL`. ------------- PR: https://git.openjdk.org/jdk/pull/9592 From jbhateja at openjdk.org Fri Jul 22 17:33:48 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 22 Jul 2022 17:33:48 GMT Subject: [jdk19] RFR: 8287794: Reverse*VNode::Identity problem Message-ID: Hi All, - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. - New IR framework based tests have been added for transforms relevant to AVX2, AVX512 and SVE. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8287794: Reverse*VNode::Identity problem Changes: https://git.openjdk.org/jdk19/pull/153/files Webrev: https://webrevs.openjdk.org/?repo=jdk19&pr=153&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287794 Stats: 414 lines in 3 files changed: 412 ins; 0 del; 2 mod Patch: https://git.openjdk.org/jdk19/pull/153.diff Fetch: git fetch https://git.openjdk.org/jdk19 pull/153/head:pull/153 PR: https://git.openjdk.org/jdk19/pull/153 From kvn at openjdk.org Fri Jul 22 18:02:04 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 22 Jul 2022 18:02:04 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 05:38:01 GMT, Tobias Hartmann wrote: >> C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. >> >> The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: >> https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 >> >> However, the control flow analysis does not catch this case. I added the missing check. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Modified debug printing code Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9589 From kvn at openjdk.org Fri Jul 22 18:16:21 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 22 Jul 2022 18:16:21 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. Good. I looked on Tobias's testing and verified that the test skipped with `UseSSE` < 4 and it is executed and passed in other configured. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9610 From dlong at openjdk.org Fri Jul 22 18:21:01 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 22 Jul 2022 18:21:01 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 13:42:09 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > New changes src/hotspot/share/compiler/compilationPolicy.hpp line 47: > 45: * for the interpreter and ciMethod::ensure_method_data, ciMethod.cpp for C1), and interacts > 46: * with C1 and C2 via the compiler interface. It is updated periodically as more profiling > 47: * information is gathered, directly in the case of the interpreter and through ciMethodData This is still a bit misleading. The information flow between MethodData and ciMethodData is one-way. Only MethodData are updated by the interpreter or generated code. ciMethodData is just a read-only snapshot used during compilation. ------------- PR: https://git.openjdk.org/jdk/pull/9598 From kvn at openjdk.org Fri Jul 22 18:42:15 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 22 Jul 2022 18:42:15 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 13:42:09 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > New changes Please note. CI (compiler interface) and its API (ciMethod*) is used by JIT compilers C1 and C2 only **during** compilation. Compilers (C1 and C2) can create MDO (through CI) if it is missed but they don't update data in it. Only Interpreter and tier 3 (profiling) **compiled code** produced by C1 updates MDO. It does it directly without CI. ------------- PR: https://git.openjdk.org/jdk/pull/9598 From kvn at openjdk.org Fri Jul 22 18:42:17 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 22 Jul 2022 18:42:17 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 08:51:55 GMT, Dean Long wrote: >> Julian Waters has updated the pull request incrementally with one additional commit since the last revision: >> >> Correct comment with respect to review > > C1 uses ciMethod::ensure_method_data(), which calls Method::build_interpreter_method_data(), to create an MDO if one wasn't already created by the interpreter. So the name build_interpreter_method_data() is a bit misleading, because C1 will use the same MDO as the interpreter. > > I also found a comment in c1_globals.hpp about C1UpdateMethodData that mentions tier1. I think the comment should be changed to say tier3. @dean-long posted his comment before I finished my :) But we are saying the same thing. ------------- PR: https://git.openjdk.org/jdk/pull/9598 From kvn at openjdk.org Fri Jul 22 18:55:07 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 22 Jul 2022 18:55:07 GMT Subject: [jdk19] RFR: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 17:24:04 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests have been added for transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin This is P4 bug and we are in phase 2 for JDK 19 release. The bug's fix version is JDK 20. Please, create new PR based on latest JDK and not JDK 19. We can later backport it into JDK 19 update release. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.org/jdk19/pull/153 From dlong at openjdk.org Fri Jul 22 19:25:10 2022 From: dlong at openjdk.org (Dean Long) Date: Fri, 22 Jul 2022 19:25:10 GMT Subject: [jdk19] RFR: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: <0nANxxruWp971ZlQRiTb9Az7l7NSMYQ8tpebyWB6as8=.a7070bb2-215a-4c37-ac29-8be544565cb4@github.com> On Fri, 22 Jul 2022 17:24:04 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests have been added for transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin The "else if" clause still returns the same value as the "else" clause, so why is the "else if" part needed? This change doesn't seem to address the question asked by the submitter of the bug: > Seems to me the first condition checks that MASKs are the same in both nodes. But if they are not, we are falling to "else" branch, where we do the same transformation anyway. So, there might be a bug lurking there when MASKs are different. ------------- Changes requested by dlong (Reviewer). PR: https://git.openjdk.org/jdk19/pull/153 From duke at openjdk.org Fri Jul 22 21:09:02 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Fri, 22 Jul 2022 21:09:02 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. Thank you for testing this patch. ------------- PR: https://git.openjdk.org/jdk/pull/9610 From jbhateja at openjdk.org Sat Jul 23 01:17:51 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 23 Jul 2022 01:17:51 GMT Subject: [jdk19] RFR: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 17:24:04 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests have been added for transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin Closing this PR. Will create one against JDK mainline with resolved review comments. ------------- PR: https://git.openjdk.org/jdk19/pull/153 From jbhateja at openjdk.org Sat Jul 23 01:17:51 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sat, 23 Jul 2022 01:17:51 GMT Subject: [jdk19] Withdrawn: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 17:24:04 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests have been added for transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk19/pull/153 From duke at openjdk.org Sat Jul 23 05:58:07 2022 From: duke at openjdk.org (duke) Date: Sat, 23 Jul 2022 05:58:07 GMT Subject: Withdrawn: 8283699: Improve the peephole mechanism of hotspot In-Reply-To: References: Message-ID: On Tue, 29 Mar 2022 23:58:39 GMT, Quan Anh Mai wrote: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/8025 From jwaters at openjdk.org Sat Jul 23 08:00:02 2022 From: jwaters at openjdk.org (Julian Waters) Date: Sat, 23 Jul 2022 08:00:02 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 18:36:54 GMT, Vladimir Kozlov wrote: > Please note. CI (compiler interface) and its API (ciMethod*) is used by JIT compilers C1 and C2 only **during** compilation. Compilers (C1 and C2) can create MDO (through CI) if it is missed but they don't update data in it. Only Interpreter and tier 3 (profiling) **compiled code** produced by C1 updates MDO. It does it directly without CI. Ah, my mistake. I'll fix the issue asap ------------- PR: https://git.openjdk.org/jdk/pull/9598 From duke at openjdk.org Sat Jul 23 13:18:05 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Sat, 23 Jul 2022 13:18:05 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: Message-ID: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay > > With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Units Gain > SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% > SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% > SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% > SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% > > As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. > > This patch also removes some redundant code paths and renames some incorrectly named instructions. > > Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: - rename - consolidate sse checks - benchmark - fix - Merge branch 'master' into improveReplicate - remove duplicate - unsignness - rematerializing input count - fix comparison - fix rematerialize, constant deduplication - ... and 8 more: https://git.openjdk.org/jdk/compare/0599a05f...6c10f9ad ------------- Changes: https://git.openjdk.org/jdk/pull/7832/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=07 Stats: 563 lines in 14 files changed: 360 ins; 85 del; 118 mod Patch: https://git.openjdk.org/jdk/pull/7832.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832 PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Sat Jul 23 13:20:03 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Sat, 23 Jul 2022 13:20:03 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v7] In-Reply-To: References: Message-ID: <-Zs75LtCJld4LgO6NVDKGTA5cuVrWTtAWRe_9-iOGX0=.f751e609-0c4c-47e8-9131-21566aba130c@github.com> On Fri, 18 Mar 2022 00:29:07 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > remove duplicate I have come back to this PR with some modifications and added a benchmark for this patch. The description is also modified to better present the objective of this patch and show the results. Please have some reviews, thank you very much. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jwaters at openjdk.org Sat Jul 23 13:59:59 2022 From: jwaters at openjdk.org (Julian Waters) Date: Sat, 23 Jul 2022 13:59:59 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v7] In-Reply-To: References: Message-ID: > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter), while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly misleading, as it seems to imply that MethodData is used in tier 3 as well, when profiling with C1 is done through ciMethodData instead. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. Julian Waters has updated the pull request incrementally with one additional commit since the last revision: Fixup ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9598/files - new: https://git.openjdk.org/jdk/pull/9598/files/c48d67f3..a6b75011 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=06 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9598&range=05-06 Stats: 12 lines in 2 files changed: 1 ins; 1 del; 10 mod Patch: https://git.openjdk.org/jdk/pull/9598.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9598/head:pull/9598 PR: https://git.openjdk.org/jdk/pull/9598 From jwaters at openjdk.org Sat Jul 23 14:00:03 2022 From: jwaters at openjdk.org (Julian Waters) Date: Sat, 23 Jul 2022 14:00:03 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 18:17:26 GMT, Dean Long wrote: >> Julian Waters has updated the pull request incrementally with one additional commit since the last revision: >> >> New changes > > src/hotspot/share/compiler/compilationPolicy.hpp line 47: > >> 45: * for the interpreter and ciMethod::ensure_method_data, ciMethod.cpp for C1), and interacts >> 46: * with C1 and C2 via the compiler interface. It is updated periodically as more profiling >> 47: * information is gathered, directly in the case of the interpreter and through ciMethodData > > This is still a bit misleading. The information flow between MethodData and ciMethodData is one-way. Only MethodData are updated by the interpreter or generated code. ciMethodData is just a read-only snapshot used during compilation. Revised ------------- PR: https://git.openjdk.org/jdk/pull/9598 From rahul.kandu at intel.com Sat Jul 23 15:45:59 2022 From: rahul.kandu at intel.com (Kandu, Rahul) Date: Sat, 23 Jul 2022 15:45:59 +0000 Subject: FW: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] In-Reply-To: References: Message-ID: -----Original Message----- From: hotspot-compiler-dev On Behalf Of Julian Waters Sent: Saturday, July 23, 2022 7:00 AM To: hotspot-dev at openjdk.org; hotspot-compiler-dev at openjdk.org Subject: Re: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v6] On Fri, 22 Jul 2022 18:17:26 GMT, Dean Long wrote: >> Julian Waters has updated the pull request incrementally with one additional commit since the last revision: >> >> New changes > > src/hotspot/share/compiler/compilationPolicy.hpp line 47: > >> 45: * for the interpreter and ciMethod::ensure_method_data, ciMethod.cpp for C1), and interacts >> 46: * with C1 and C2 via the compiler interface. It is updated periodically as more profiling >> 47: * information is gathered, directly in the case of the interpreter and through ciMethodData > > This is still a bit misleading. The information flow between MethodData and ciMethodData is one-way. Only MethodData are updated by the interpreter or generated code. ciMethodData is just a read-only snapshot used during compilation. Revised ------------- PR: https://git.openjdk.org/jdk/pull/9598 From jbhateja at openjdk.org Sun Jul 24 12:37:39 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Sun, 24 Jul 2022 12:37:39 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem Message-ID: Hi All, - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8287794: Reverse*VNode::Identity problem Changes: https://git.openjdk.org/jdk/pull/9623/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8287794 Stats: 335 lines in 2 files changed: 310 ins; 12 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/9623.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9623/head:pull/9623 PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Mon Jul 25 02:27:55 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 25 Jul 2022 02:27:55 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 06:58:53 GMT, Xiaohong Gong wrote: > The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: > > ldr q16, [x15, #16] ; load the "src" vector > mov v17.16b, v16.16b ; reverse bytes "src" > ldr q18, [x13, #16] > neg v18.16b, v18.16b ; load the vector mask > bsl v18.16b, v17.16b, v16.16b ; vector blend > > The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. > > To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. > > Here is the performance data for the jmh benchmark [1] on ARM NEON: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms > > This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. > > And here is the performance data on an x86 system: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 Hi, could anyone please help to take a look at this simple patch? Thanks a lot for your time! ------------- PR: https://git.openjdk.org/jdk/pull/9565 From haosun at openjdk.org Mon Jul 25 03:47:24 2022 From: haosun at openjdk.org (Hao Sun) Date: Mon, 25 Jul 2022 03:47:24 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: > **MOTIVATION** > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > **Drawback-2**: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > **KEY UPDATES** > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > - Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > - Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves **Drawback-2**. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > **MINOR UPDATES and CODE REFACTORING** > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > - Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > - Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > - Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > **TESTING** > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > **BENEFITS** > > The number of matching rules is reduced to ~ **42%**. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313 (aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (**nearly 1 MB reduction**) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules Merge branch "master". - Add m4 file Add the corresponding M4 file - Add VM_Version flag to control NEON instruction generation Add VM_Version flag use_neon_for_vector() to control whether to generate NEON instructions for 128-bit vector operations. Currently only vector length is checked inside and it returns true for existing SVE cores. More specific things might be checked in the near future, e.g., the basic data type or SVE CPU model. Besides, new macro assembler helpers neon_vector_extend/narrow() are introduced to make the code clean. Note: AddReductionVF/D rules are updated so that SVE instructions are generated for 64/128-bit vector operations, because floating point reduction add instructions are supported directly in SVE. - Merge branch 'master' as of 7th-July into 8285790-merge-rules - 8285790: AArch64: Merge C2 NEON and SVE matching rules MOTIVATION This is a big refactoring patch of merging rules in aarch64_sve.ad and aarch64_neon.ad. The motivation can also be found at [1]. Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE and NEON codegen respectively. 1) For SVE rules we use vReg operand to match VecA for an arbitrary length of vector type, when SVE is enabled; 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for 128-bit/64-bit vectors, when SVE is not enabled. This separation looked clean at the time of introducing SVE support. However, there are two main drawbacks now. Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and SVE vector registers share the lower 128 bits with NEON registers. For some cases, even when SVE is enabled, we still prefer to match NEON rules and emit NEON instructions. Drawback-2: With more and more vector rules added to support VectorAPI, there are lots of rules in both two ad files with different predication conditions, e.g., different values of UseSVE or vector type/size. Examples can be found in [1]. These two drawbacks make the code less maintainable and increase the libjvm.so code size. KEY UPDATES In this patch, we mainly do two things, using generic vReg to match all NEON/SVE vector registers and merging NEON/SVE matching rules. Update-1: Use generic vReg to match all NEON/SVE vector registers Two different approaches were considered, and we prefer to use generic vector solution but keep VecA operand for all >128-bit vectors. See the last slide in [1]. All the changes lie in the AArch64 backend. 1) Some helpers are updated in aarch64.ad to enable generic vector on AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), is_reg2reg_move() and is_generic_vector(). 2) Operand vecA is created to match VecA register, and vReg is updated to match VecA/D/X registers dynamically. With the introduction of generic vReg, difference in register types between NEON rules and SVE rules can be eliminated, which makes it easy to merge these rules. Update-2: Try to merge existing rules As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is introduced to hold the grouped and merged matching rules. 1) Similar rules with difference in vector type/size can be merged into new rules, where different types and vector sizes are handled in the codegen part, e.g., vadd(). This resolves Drawback-2. 2) In most cases, we tend to emit NEON instructions for 128-bit vector operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. It's important to note that there are some exceptions. Exception-1: For some rules, there are no direct NEON instructions, but exists simple SVE implementation due to newly added SVE ISA. Such rules include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. Exception-2: Vector mask generation and operation rules are different because vector mask is stored in different types of registers between NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. Exception-3: Shift right related rules are different because vector shift right instructions differ a bit between NEON and SVE. For these exceptions, we emit NEON or SVE code simply based on UseSVE options. MINOR UPDATES and CODE REFACTORING Since we've touched all lines of code during merging rules, we further do more minor updates and refactoring. 1. Reduce regmask bits Stack slot alignment is handled specially for scalable vector, which will firstly align to SlotsPerVecA, and then align to the real vector length. We should guarantee SlotsPerVecA is no bigger than the real vector length. Otherwise, unused stack space would be allocated. In AArch64 SVE, the vector length can be 128 to 2048 bits. However, SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, on a 128-bit SVE platform, the stack slot is aligned to 256 bits, leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA from 8 to 4. See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad (chunk1 and vectora_reg). 2. Refactor NEON/SVE vector op support check. Merge NEON and SVE vector supported check into one single function. To be consistent, SVE default size supported check now is relaxed from no less than 64 bits to the same condition as NEON's min_vector_size(), i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, as we assume at least we will emit NEON code for those small vectors, with unified rules. 3. Some notes for new rules 1) Since new rules are unique and it makes no sense to set different "ins_cost", we turn to use the default cost. 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad now. Hence, many SIMD pipeline classes at aarch64.ad become unused and can be removed. 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the matching rule names if needed. a) 'le128b' means the vector length is less than or equal to 128 bits. This rule can be matched on both NEON and 128-bit SVE. b) 'gt128b' means the vector length is greater than 128 bits. This rule can only be matched on SVE. c) 'neon' means this rule can only be matched on NEON, i.e. the generated instruction is not better than those in 128-bit SVE. d) 'sve' means this rule is only matched on SVE for all possible vector length, i.e. not limited to gt128b. Note-1: m4 file is not introduced because many duplications are highly reduced now. Note-2: We guess the code review for this big patch would probably take some time and we may need to merge latest code from master branch from time to time. We prefer to keep aarch64_neon/sve.ad and the corresponding m4 files for easy comparison and review. Of course, they will be finally removed after some solid reviews before integration. Note-3: Several other minor refactorings are done in this patch, but we cannot list all of them in the commit message. We have reviewed and tested the rules carefully to guarantee the quality. TESTING 1) Cross compilations on arm32/s390/pps/riscv passed. 2) tier1~3 jtreg passed on both x64 and aarch64 machines. 3) vector tests: all the test cases under the following directories can pass on both NEON and SVE systems with max vector length 16/32/64 bytes. "test/hotspot/jtreg/compiler/vectorapi/" "test/jdk/jdk/incubator/vector/" "test/hotspot/jtreg/compiler/vectorization/" 4) Performance evaluation: we choose vector micro-benchmarks from panama-vector:vectorIntrinsics [2] to evaluate the performance of this patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE platform and one NEON platform, and didn't see any visiable regression with NEON and SVE. We will continue to verify more cases on other platforms with NEON and different SVE vector sizes. BENEFITS The number of matching rules is reduced to ~ 42%. before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 after : 313(aarch64_vector.ad) Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. before: 25246528 B (commit 7905788e969) after : 24208776 B (nearly 1 MB reduction) [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation Co-Developed-by: Ningsheng Jian Co-Developed-by: Eric Liu ------------- Changes: https://git.openjdk.org/jdk/pull/9346/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=02 Stats: 12309 lines in 15 files changed: 11604 ins; 582 del; 123 mod Patch: https://git.openjdk.org/jdk/pull/9346.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346 PR: https://git.openjdk.org/jdk/pull/9346 From xgong at openjdk.org Mon Jul 25 03:54:03 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 25 Jul 2022 03:54:03 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 17:39:27 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 80: > 78: > 79: @Test > 80: @IR(applyIfCPUFeature={"sve", "true"}, failOn = {"ReverseBytesV" , " > 0 "}) Use " failOn = "ReverseBytesV" " instead? test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 99: > 97: > 98: @Test > 99: @IR(applyIfCPUFeatureOr={"sve", "true", "simd", "true", "avx2", "true"}, counts = {"ReverseBytesV" , " > 0 "}) After https://github.com/openjdk/jdk/pull/9509 merged, I think we'd better to consider different vm flags like "UseAVX", "UseSVE" for each architecture. For example, if the cpu feature mathes "sve", but user may set "-XX:UseSVE=0". With such options, this IR test will also run and I'm afraid it will fail with "-XX:UseSVE=0". ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Mon Jul 25 04:03:51 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 25 Jul 2022 04:03:51 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 17:39:27 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/share/opto/vectornode.cpp line 1857: > 1855: if (n->is_predicated_using_blend()) { > 1856: return n; > 1857: } The change in this patch looks fine to me! Just a concern the previous codes, for patterns like: VectorBlend X (ReverseBytesV (ReverseBytesV Y)) MASK we will miss the transformation to: VectorBlend X Y MASK right? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Mon Jul 25 05:29:05 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 25 Jul 2022 05:29:05 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: Message-ID: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8287794: Review comments resolved. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9623/files - new: https://git.openjdk.org/jdk/pull/9623/files/dee05b27..845b935d Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=00-01 Stats: 6 lines in 1 file changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9623.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9623/head:pull/9623 PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Mon Jul 25 05:29:09 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 25 Jul 2022 05:29:09 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 04:00:10 GMT, Xiaohong Gong wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8287794: Review comments resolved. > > src/hotspot/share/opto/vectornode.cpp line 1857: > >> 1855: if (n->is_predicated_using_blend()) { >> 1856: return n; >> 1857: } > > The change in this patch looks fine to me! Just a concern the previous codes, for patterns like: > > VectorBlend X (ReverseBytesV (ReverseBytesV Y)) MASK > > we will miss the transformation to: > > VectorBlend X Y MASK > > right? Thanks! No, since the flag Predicated_with_blend is set over nodes if operation itself is predicated, here both ReverseBytesV nodes are non-predicated ones and this check is part of identity routines of reverse* operations. > test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 80: > >> 78: >> 79: @Test >> 80: @IR(applyIfCPUFeature={"sve", "true"}, failOn = {"ReverseBytesV" , " > 0 "}) > > Use " failOn = "ReverseBytesV" " instead? Thanks for noticing this, semantically, failOn NODE is same as failOn NODE , "> 0", tried and tested it, but it looks like a non-standard usage of the option. > test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 99: > >> 97: >> 98: @Test >> 99: @IR(applyIfCPUFeatureOr={"sve", "true", "simd", "true", "avx2", "true"}, counts = {"ReverseBytesV" , " > 0 "}) > > After https://github.com/openjdk/jdk/pull/9509 merged, I think we'd better to consider different vm flags like "UseAVX", "UseSVE" for each architecture. For example, if the cpu feature mathes "sve", but user may set "-XX:UseSVE=0". With such options, this IR test will also run and I'm afraid it will fail with "-XX:UseSVE=0". CPU features list is populated during VM startup, this list is later ON queried by applyIfCPUFeature during IR validations, I do not think UseSVE is a valid flag for X86, SPROMPT>java -XX:UseSVE=0 Unrecognized VM option 'UseSVE=0' Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. SPROMPT>lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 46 bits physical, 48 bits virtual CPU(s): 72 On-line CPU(s) list: 0-71 Thread(s) per core: 2 Core(s) per socket: 18 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Mon Jul 25 05:41:01 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 25 Jul 2022 05:41:01 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: Message-ID: <8ahGNkHakMZhDPWeFVVR4ObhJ8bMarlcAf4j9Hz_r1A=.cbac7227-73ca-45e8-b9f9-4d500e7331e1@github.com> On Mon, 25 Jul 2022 05:24:34 GMT, Jatin Bhateja wrote: >> test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 99: >> >>> 97: >>> 98: @Test >>> 99: @IR(applyIfCPUFeatureOr={"sve", "true", "simd", "true", "avx2", "true"}, counts = {"ReverseBytesV" , " > 0 "}) >> >> After https://github.com/openjdk/jdk/pull/9509 merged, I think we'd better to consider different vm flags like "UseAVX", "UseSVE" for each architecture. For example, if the cpu feature mathes "sve", but user may set "-XX:UseSVE=0". With such options, this IR test will also run and I'm afraid it will fail with "-XX:UseSVE=0". > > CPU features list is populated during VM startup, this list is later ON queried by applyIfCPUFeature during IR validations, I do not think UseSVE is a valid flag for X86, > > > SPROMPT>java -XX:UseSVE=0 > Unrecognized VM option 'UseSVE=0' > Error: Could not create the Java Virtual Machine. > Error: A fatal exception has occurred. Program will exit. > SPROMPT>lscpu > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > Address sizes: 46 bits physical, 48 bits virtual > CPU(s): 72 > On-line CPU(s) list: 0-71 > Thread(s) per core: 2 > Core(s) per socket: 18 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz Yes, it's invalid on x86. So maybe you could add the limitation to the "requires", but seems this could make the codes complex. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Mon Jul 25 06:16:49 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 25 Jul 2022 06:16:49 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 05:24:48 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/vectornode.cpp line 1857: >> >>> 1855: if (n->is_predicated_using_blend()) { >>> 1856: return n; >>> 1857: } >> >> The change in this patch looks fine to me! Just a concern the previous codes, for patterns like: >> >> VectorBlend X (ReverseBytesV (ReverseBytesV Y)) MASK >> >> we will miss the transformation to: >> >> VectorBlend X Y MASK >> >> right? Thanks! > > No, since the flag Predicated_with_blend is set over nodes if operation itself is predicated, here both ReverseBytesV nodes are non-predicated ones and this check is part of identity routines of reverse* operations. Oh, right! What I mean is the case like: VectorBlend (ReverseBytesV X) (ReverseBytesV (ReverseBytesV X)) MASK ==> VectorBlend (ReverseBytesV X) X MASK which is the same case with `test_reversebytes_long_transform2` for non-predicated systems. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From fyang at openjdk.org Mon Jul 25 07:34:49 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 25 Jul 2022 07:34:49 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 10:18:05 GMT, Sacha Coppey wrote: > This patch adds a JVMCI implementation for RISC-V. It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It should be tested soon in GraalVM Native Image as well. Hi, I see some JVM crash when I try the following test with fastdebug build with your patch: test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/SimpleDebugInfoTest.java //////////////////////////////////////////////////// Internal Error (/home/fyang/openjdk-jdk/src/hotspot/cpu/riscv/nativeInst_riscv.cpp:118), pid=1154 063, tid=1154084 assert(NativeCall::is_call_at((address)this)) failed: unexpected code at call site //////////////////////////////////////////////////// ------------- PR: https://git.openjdk.org/jdk/pull/9587 From jbhateja at openjdk.org Mon Jul 25 07:51:48 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Mon, 25 Jul 2022 07:51:48 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 06:13:09 GMT, Xiaohong Gong wrote: >> No, since the flag Predicated_with_blend is set over nodes if operation itself is predicated, here both ReverseBytesV nodes are non-predicated ones and this check is part of identity routines of reverse* operations. > > Oh, right! What I mean is the case like: > > VectorBlend (ReverseBytesV X) (ReverseBytesV (ReverseBytesV X)) MASK ==> VectorBlend (ReverseBytesV X) X MASK > > which is the same case with `test_reversebytes_long_transform2` for non-predicated systems. As already [discussed](https://github.com/openjdk/panama-vector/pull/182#discussion_r927419799), we can handle this as a separate PR along with other complimentary operations. This bug fix patch is fixing a specific issue. >> CPU features list is populated during VM startup, this list is later ON queried by applyIfCPUFeature during IR validations, I do not think UseSVE is a valid flag for X86, >> >> >> SPROMPT>java -XX:UseSVE=0 >> Unrecognized VM option 'UseSVE=0' >> Error: Could not create the Java Virtual Machine. >> Error: A fatal exception has occurred. Program will exit. >> SPROMPT>lscpu >> Architecture: x86_64 >> CPU op-mode(s): 32-bit, 64-bit >> Byte Order: Little Endian >> Address sizes: 46 bits physical, 48 bits virtual >> CPU(s): 72 >> On-line CPU(s) list: 0-71 >> Thread(s) per core: 2 >> Core(s) per socket: 18 >> Socket(s): 2 >> NUMA node(s): 2 >> Vendor ID: GenuineIntel >> CPU family: 6 >> Model: 85 >> Model name: Intel(R) Xeon(R) Gold 6139 CPU @ 2.30GHz > > Yes, it's invalid on x86. So maybe you could add the limitation to the "requires", but seems this could make the codes complex. Correct. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Mon Jul 25 07:59:53 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Mon, 25 Jul 2022 07:59:53 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: Message-ID: <7RSnILoD_OseCCAaGJh99dWWM9zxo3_I8ohW3lL-NxE=.df5f1dc8-6552-47d2-afb0-70af7b9dcb99@github.com> On Mon, 25 Jul 2022 07:48:37 GMT, Jatin Bhateja wrote: >> Oh, right! What I mean is the case like: >> >> VectorBlend (ReverseBytesV X) (ReverseBytesV (ReverseBytesV X)) MASK ==> VectorBlend (ReverseBytesV X) X MASK >> >> which is the same case with `test_reversebytes_long_transform2` for non-predicated systems. > > As already [discussed](https://github.com/openjdk/panama-vector/pull/182#discussion_r927419799), we can handle this as a separate PR along with other complimentary operations. This bug fix patch is fixing a specific issue. Agree, thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9623 From thartmann at openjdk.org Mon Jul 25 09:07:52 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 09:07:52 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 05:38:01 GMT, Tobias Hartmann wrote: >> C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. >> >> The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: >> https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 >> >> However, the control flow analysis does not catch this case. I added the missing check. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Modified debug printing code Thanks, Vladimir! ------------- PR: https://git.openjdk.org/jdk/pull/9589 From thartmann at openjdk.org Mon Jul 25 09:09:00 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 09:09:00 GMT Subject: RFR: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9610 From duke at openjdk.org Mon Jul 25 09:12:09 2022 From: duke at openjdk.org (Bhavana-Kilambi) Date: Mon, 25 Jul 2022 09:12:09 GMT Subject: Integrated: 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." In-Reply-To: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> References: <2VU2x5lH3EHO00HwlqqOdVR-r1T_wIAABL-PUt2U5Uk=.e0b4e158-fa0f-4bea-88ca-e423c23d1dbf@github.com> Message-ID: On Fri, 22 Jul 2022 09:40:37 GMT, Bhavana-Kilambi wrote: > ? "IRViolationException: There were one or multiple IR rule failures." > > The IR test - TestAutoVecIntMinMax.java was introduced in https://bugs.openjdk.org/browse/JDK-8288107 to test IR generation of MaxV and MinV nodes when the MinI/MaxI nodes are auto-vectorized. > However, the corresponding vector ISA support for min/max on x64 machines is only available in SSE versions > 3 and AVX. > The "@requires" annotation in the JTREG test has been modified to use the whitelisted flags instead. > Deleting the entry for this JTREG test in test/hotspot/jtreg/ProblemList.txt. This pull request has now been integrated. Changeset: 80dc6ceb Author: Bhavana Kilambi Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/80dc6cebc90f7ed5c4a262e2dcd3bd54ce71eab1 Stats: 4 lines in 2 files changed: 0 ins; 2 del; 2 mod 8290730: compiler/vectorization/TestAutoVecIntMinMax.java failed with "IRViolationException: There were one or multiple IR rule failures." Reviewed-by: jiefu, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9610 From thartmann at openjdk.org Mon Jul 25 10:33:22 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 10:33:22 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v9] In-Reply-To: <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <7lCOZoReMvWJnID_7hsmiVFqy1Xt05x5hmSaoLykzV0=.0bb39ca4-a1d7-4856-bb75-ea92ec8f5ea0@github.com> Message-ID: On Tue, 14 Jun 2022 01:49:34 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: > > - Add an IR framework testcase > > Change-Id: Ifbcc8d233aa27dfe93acef548c7e42721d86376e > - Merge branch 'master' into fg8283091 > > Change-Id: I9525ae9310c3c493da29490d034cbb8f223e7f80 > - Update to the latest JDK and fix the function name > > Change-Id: Ie1907f86e2df7051aa2ddb7e5b05a371e887d1bc > - Merge branch 'master' into fg8283091 > > Change-Id: I3ef746178c07004cc34c22081a3044fb40e87702 > - Add assertion line for opcode() and withdraw some common code as a function > > Change-Id: I7b5dbe60fec6979de454f347d074e6fc01126dfe > - Merge branch 'master' into fg8283091 > > Change-Id: I42bec08da55e86fb1f049bb691138f3fcf6dbed6 > - Implement an interface for auto-vectorization to consult supported match rules > > Change-Id: I8dcfae69a40717356757396faa06ae2d6015d701 > - Merge branch 'master' into fg8283091 > > Change-Id: Ieb9a530571926520e478657159d9eea1b0f8a7dd > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - ... and 3 more: https://git.openjdk.org/jdk/compare/f1143b1b...49e6f56e This change introduced a regression, see [JDK-8290910](https://bugs.openjdk.org/browse/JDK-8290910). ------------- PR: https://git.openjdk.org/jdk/pull/7806 From duke at openjdk.org Mon Jul 25 12:02:30 2022 From: duke at openjdk.org (Sacha Coppey) Date: Mon, 25 Jul 2022 12:02:30 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V [v2] In-Reply-To: References: Message-ID: <4RWjt5pOsf8Qswdf7ViTiJMLkvdyNQ6KwVSuj6X09bo=.ac7e8054-280d-4893-9d9f-00d3b36ce813@github.com> > This patch adds a JVMCI implementation for RISC-V. It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It should be tested soon in GraalVM Native Image as well. Sacha Coppey has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: 8290154: [JVMCI] Implement JVMCI for RISC-V ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9587/files - new: https://git.openjdk.org/jdk/pull/9587/files/df247c0b..68882a86 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=00-01 Stats: 7 lines in 1 file changed: 2 ins; 1 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From thartmann at openjdk.org Mon Jul 25 12:55:04 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 12:55:04 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v3] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 03:57:22 GMT, Pengfei Li wrote: >> Recently we found some array range checks in loops are not hoisted by >> C2's loop predication phase as expected. Below is a typical case. >> >> for (int i = 0; i < size; i++) { >> b[3 * i] = a[3 * i]; >> } >> >> Ideally, C2 can hoist the range check of an array access in loop if the >> array index is a linear function of the loop's induction variable (iv). >> Say, range check in `arr[exp]` can be hoisted if >> >> exp = k1 * iv + k2 + inv >> >> where `k1` and `k2` are compile-time constants, and `inv` is an optional >> loop invariant. But in above case, C2 igvn does some strength reduction >> on the `MulINode` used to compute `3 * i`. It results in the linear index >> expression not being recognized. So far we found 2 ideal transformations >> that may affect linear expression recognition. They are >> >> - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values >> - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value >> >> To avoid range check hoisting and further optimizations being broken, we >> have tried improving the linear recognition. But after some experiments, >> we found complex and recursive pattern match does not always work well. >> In this patch we propose to defer these 2 ideal transformations to the >> phase of post loop igvn. In other words, these 2 strength reductions can >> only be done after all loop optimizations are over. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. >> We also tested the performance via JMH and see obvious improvement. >> >> Benchmark Improvement >> RangeCheckHoisting.ivScaled3 +21.2% >> RangeCheckHoisting.ivScaled7 +6.6% > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address more comments Looks good to me. I resubmitted testing and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9508 From thartmann at openjdk.org Mon Jul 25 12:57:10 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 12:57:10 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v7] In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 13:59:59 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter) and tier 3, while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly confusing. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > Fixup Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9598 From thartmann at openjdk.org Mon Jul 25 13:25:15 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 13:25:15 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 06:58:53 GMT, Xiaohong Gong wrote: > The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: > > ldr q16, [x15, #16] ; load the "src" vector > mov v17.16b, v16.16b ; reverse bytes "src" > ldr q18, [x13, #16] > neg v18.16b, v18.16b ; load the vector mask > bsl v18.16b, v17.16b, v16.16b ; vector blend > > The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. > > To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. > > Here is the performance data for the jmh benchmark [1] on ARM NEON: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms > > This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. > > And here is the performance data on an x86 system: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 Looks good to me. I'll run some testing and report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9565 From thartmann at openjdk.org Mon Jul 25 13:37:01 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 13:37:01 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Mon, 25 Jul 2022 05:29:05 GMT, Jatin Bhateja wrote: >> Hi All, >> >> - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. >> - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. >> - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8287794: Review comments resolved. Changes requested by thartmann (Reviewer). src/hotspot/share/opto/vectornode.cpp line 1864: > 1862: // OperationV (OperationV X) => X > 1863: } else if (!n->is_predicated_vector() && !in1->is_predicated_vector()) { > 1864: return in1->in(1); But this will still trigger the SonarCloud warning originally reported by @shipilev because both if and else branch contain the same code, right? Shouldn't the conditions be merged? ------------- PR: https://git.openjdk.org/jdk/pull/9623 From duke at openjdk.org Mon Jul 25 13:41:04 2022 From: duke at openjdk.org (Sacha Coppey) Date: Mon, 25 Jul 2022 13:41:04 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V [v3] In-Reply-To: References: Message-ID: <8aRWtlLJUymEF1hJG0jEHZrPAE_W66D1yNPNCPWuPBs=.08b33aa3-c785-408a-a5f6-3d38fa739737@github.com> > This patch adds a JVMCI implementation for RISC-V. It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It should be tested soon in GraalVM Native Image as well. Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: Use nativeInstruction_at instead of nativeCall_at to avoid wrongly initializating a call ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9587/files - new: https://git.openjdk.org/jdk/pull/9587/files/68882a86..925a2651 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=01-02 Stats: 8 lines in 1 file changed: 1 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From shade at openjdk.org Mon Jul 25 13:41:06 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 25 Jul 2022 13:41:06 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Mon, 25 Jul 2022 13:33:28 GMT, Tobias Hartmann wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8287794: Review comments resolved. > > src/hotspot/share/opto/vectornode.cpp line 1864: > >> 1862: // OperationV (OperationV X) => X >> 1863: } else if (!n->is_predicated_vector() && !in1->is_predicated_vector()) { >> 1864: return in1->in(1); > > But this will still trigger the SonarCloud warning originally reported by @shipilev because both if and else branch contain the same code, right? Shouldn't the conditions be merged? I don't think it would trigger a warning. The original warning, as I understand it, was to say that the *unpredicated* `else` branch is the same. So we were guaranteed to take either of branches, and thus the same code, irrelevant of the predicate. It is not the same here: we now have a third path, going out without entering either branch :) ------------- PR: https://git.openjdk.org/jdk/pull/9623 From thartmann at openjdk.org Mon Jul 25 14:25:49 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 14:25:49 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Mon, 25 Jul 2022 05:29:05 GMT, Jatin Bhateja wrote: >> Hi All, >> >> - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. >> - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. >> - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8287794: Review comments resolved. Looks good to me. I submitted testing and will report back once it passed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9623 From thartmann at openjdk.org Mon Jul 25 14:25:49 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Mon, 25 Jul 2022 14:25:49 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Mon, 25 Jul 2022 13:38:59 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/opto/vectornode.cpp line 1864: >> >>> 1862: // OperationV (OperationV X) => X >>> 1863: } else if (!n->is_predicated_vector() && !in1->is_predicated_vector()) { >>> 1864: return in1->in(1); >> >> But this will still trigger the SonarCloud warning originally reported by @shipilev because both if and else branch contain the same code, right? Shouldn't the conditions be merged? > > I don't think it would trigger a warning. The original warning, as I understand it, was to say that the *unpredicated* `else` branch is the same. So we were guaranteed to take either of branches, and thus the same code, irrelevant of the predicate. It is not the same here: we now have a third path, going out without entering either branch :) Ah, right. That makes sense. I still think the branches could me merged but I don't have a strong opinion. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From duke at openjdk.org Mon Jul 25 14:38:26 2022 From: duke at openjdk.org (Sacha Coppey) Date: Mon, 25 Jul 2022 14:38:26 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V [v4] In-Reply-To: References: Message-ID: > This patch adds a JVMCI implementation for RISC-V. It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It should be tested soon in GraalVM Native Image as well. Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: Avoid using set_destination when call is not jal ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9587/files - new: https://git.openjdk.org/jdk/pull/9587/files/925a2651..9f7cbf6c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=02-03 Stats: 13 lines in 2 files changed: 0 ins; 7 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From kvn at openjdk.org Mon Jul 25 17:20:40 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Mon, 25 Jul 2022 17:20:40 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v7] In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 13:59:59 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter) and tier 3, while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly confusing. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > Fixup Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9598 From duke at openjdk.org Mon Jul 25 17:27:25 2022 From: duke at openjdk.org (Sacha Coppey) Date: Mon, 25 Jul 2022 17:27:25 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 07:30:08 GMT, Fei Yang wrote: > Hi, I see some JVM crash when I try the following test with fastdebug build with your patch: test/hotspot/jtreg/compiler/jvmci/jdk.vm.ci.code.test/src/jdk/vm/ci/code/test/SimpleDebugInfoTest.java Thank you for pointing this out, I did not run the patch with fastdebug. I saw other issues after solving this one, so I take some time to solve them as well. ------------- PR: https://git.openjdk.org/jdk/pull/9587 From jwaters at openjdk.org Mon Jul 25 17:32:48 2022 From: jwaters at openjdk.org (Julian Waters) Date: Mon, 25 Jul 2022 17:32:48 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v7] In-Reply-To: References: Message-ID: <6x_al97Iyv5cUa6ExZGsQZ4PJP0uus3vrAtw7OG4Mu4=.ee74aa10-4666-4fe9-ad70-833b58172767@github.com> On Sat, 23 Jul 2022 13:59:59 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter) and tier 3, while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly confusing. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > Fixup Thanks all for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/9598 From dlong at openjdk.org Mon Jul 25 22:05:26 2022 From: dlong at openjdk.org (Dean Long) Date: Mon, 25 Jul 2022 22:05:26 GMT Subject: RFR: 8290834: Improve potentially confusing documentation on collection of profiling information [v7] In-Reply-To: References: Message-ID: <1MQcV55lgirNA2zPA0N1YTJliGPaJ9JiwCZmhcpQN1s=.2b6d3278-449e-4ab1-b854-e6e2715a373a@github.com> On Sat, 23 Jul 2022 13:59:59 GMT, Julian Waters wrote: >> Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter) and tier 3, while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly confusing. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. > > Julian Waters has updated the pull request incrementally with one additional commit since the last revision: > > Fixup Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9598 From jwaters at openjdk.org Mon Jul 25 22:50:05 2022 From: jwaters at openjdk.org (Julian Waters) Date: Mon, 25 Jul 2022 22:50:05 GMT Subject: Integrated: 8290834: Improve potentially confusing documentation on collection of profiling information In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 18:36:43 GMT, Julian Waters wrote: > Documentation on the MethodData object incorrectly states that it is used when profiling in tiers 0 and 1, when it only does so for tier 0 (Interpreter) and tier 3, while tier 1 (Fully optimizing C1) does not collect any profile data at all. Additionally, the description for the different execution tiers is slightly confusing. This cleanup attempts to slightly better clarify how profiling is tied together between the Interpreter and C1, explain what MDO is an abbreviation for (MethodData object), and corrects the documentation for MethodData as well. This pull request has now been integrated. Changeset: 0ca5cb13 Author: Julian Waters Committer: Dean Long URL: https://git.openjdk.org/jdk/commit/0ca5cb13a38105a4334ac3508a9c7155fc00cac3 Stats: 16 lines in 3 files changed: 11 ins; 0 del; 5 mod 8290834: Improve potentially confusing documentation on collection of profiling information Reviewed-by: thartmann, kvn, dlong ------------- PR: https://git.openjdk.org/jdk/pull/9598 From xgong at openjdk.org Tue Jul 26 01:11:01 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 26 Jul 2022 01:11:01 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: <4Mmn_wRbuZRJ7XNC_RUQwNalLaDH7sKdCFerWwBAOe0=.6df9c06e-10d8-4ab5-bcdf-0b421890709c@github.com> On Mon, 25 Jul 2022 13:22:11 GMT, Tobias Hartmann wrote: >> The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: >> >> ldr q16, [x15, #16] ; load the "src" vector >> mov v17.16b, v16.16b ; reverse bytes "src" >> ldr q18, [x13, #16] >> neg v18.16b, v18.16b ; load the vector mask >> bsl v18.16b, v17.16b, v16.16b ; vector blend >> >> The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. >> >> To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. >> >> Here is the performance data for the jmh benchmark [1] on ARM NEON: >> >> Benchmark (size) Mode Cnt Before After Units >> ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms >> ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms >> >> This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. >> >> And here is the performance data on an x86 system: >> >> Benchmark (size) Mode Cnt Before After Units >> ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms >> ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms >> >> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 > > Looks good to me. I'll run some testing and report back once it passed. Thanks for the review and testing @TobiHartmann ! ------------- PR: https://git.openjdk.org/jdk/pull/9565 From kvn at openjdk.org Tue Jul 26 02:39:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 02:39:57 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 06:58:53 GMT, Xiaohong Gong wrote: > The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: > > ldr q16, [x15, #16] ; load the "src" vector > mov v17.16b, v16.16b ; reverse bytes "src" > ldr q18, [x13, #16] > neg v18.16b, v18.16b ; load the vector mask > bsl v18.16b, v17.16b, v16.16b ; vector blend > > The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. > > To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. > > Here is the performance data for the jmh benchmark [1] on ARM NEON: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms > > This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. > > And here is the performance data on an x86 system: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9565 From xgong at openjdk.org Tue Jul 26 02:59:02 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 26 Jul 2022 02:59:02 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 02:36:31 GMT, Vladimir Kozlov wrote: >> The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: >> >> ldr q16, [x15, #16] ; load the "src" vector >> mov v17.16b, v16.16b ; reverse bytes "src" >> ldr q18, [x13, #16] >> neg v18.16b, v18.16b ; load the vector mask >> bsl v18.16b, v17.16b, v16.16b ; vector blend >> >> The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. >> >> To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. >> >> Here is the performance data for the jmh benchmark [1] on ARM NEON: >> >> Benchmark (size) Mode Cnt Before After Units >> ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms >> ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms >> >> This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. >> >> And here is the performance data on an x86 system: >> >> Benchmark (size) Mode Cnt Before After Units >> ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms >> ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms >> >> [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 > > Good. Thanks for the review @vnkozlov ! ------------- PR: https://git.openjdk.org/jdk/pull/9565 From xgong at openjdk.org Tue Jul 26 02:59:04 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 26 Jul 2022 02:59:04 GMT Subject: Integrated: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: On Wed, 20 Jul 2022 06:58:53 GMT, Xiaohong Gong wrote: > The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: > > ldr q16, [x15, #16] ; load the "src" vector > mov v17.16b, v16.16b ; reverse bytes "src" > ldr q18, [x13, #16] > neg v18.16b, v18.16b ; load the vector mask > bsl v18.16b, v17.16b, v16.16b ; vector blend > > The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. > > To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. > > Here is the performance data for the jmh benchmark [1] on ARM NEON: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms > > This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. > > And here is the performance data on an x86 system: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 This pull request has now been integrated. Changeset: a6faf5d3 Author: Xiaohong Gong URL: https://git.openjdk.org/jdk/commit/a6faf5d33a09ca53e5d1c60a5ed82f2368a6e1b3 Stats: 123 lines in 4 files changed: 121 ins; 0 del; 2 mod 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9565 From kvn at openjdk.org Tue Jul 26 03:03:13 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 03:03:13 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps In-Reply-To: References: Message-ID: <-WOTq68zVNUdFogau-S76PXvsDr0utYVDTdxjrDJF1c=.345530bc-8b7e-44fd-aa2b-d5d5a3e5db03@github.com> On Thu, 21 Jul 2022 19:54:11 GMT, Xin Liu wrote: > An unstable if trap is **superficial** if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial. > > The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code. > > For example, C2 generates an uncommon_trap for the else if cond is very likely true. > > public static int foo(boolean cond, int i) { > Value x = new Value(0); > Value y = new Value(1); > Value z = new Value(i); > > if (cond) { > i++; > } > return x._value + y._value + z._value + i; > } > > > If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this > > 037 call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') > # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0 > # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 } > # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 } > # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 } > # OopMap {off=60/0x3c} > 03c stop # ShouldNotReachHere > > > Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly' > > <-XX:-OptimizeUnstableIf> > Compiled method (c2) 346 17 4 SuperficialIfTrap::foo (53 bytes) > total in heap [0x00007f50f4970910,0x00007f50f4970b70] = 608 > relocation [0x00007f50f4970a70,0x00007f50f4970a80] = 16 > main code [0x00007f50f4970a80,0x00007f50f4970ad8] = 88 > stub code [0x00007f50f4970ad8,0x00007f50f4970af0] = 24 > oops [0x00007f50f4970af0,0x00007f50f4970b00] = 16 > metadata [0x00007f50f4970b00,0x00007f50f4970b08] = 8 > scopes data [0x00007f50f4970b08,0x00007f50f4970b38] = 48 > scopes pcs [0x00007f50f4970b38,0x00007f50f4970b68] = 48 > dependencies [0x00007f50f4970b68,0x00007f50f4970b70] = 8 > > <-XX:+OptimizeUnstableIf> > Compiled method (c2) 309 17 4 SuperficialIfTrap::foo (53 bytes) > total in heap [0x00007f4090970910,0x00007f4090970b18] = 520 > relocation [0x00007f4090970a70,0x00007f4090970a80] = 16 > main code [0x00007f4090970a80,0x00007f4090970ac8] = 72 > stub code [0x00007f4090970ac8,0x00007f4090970ae0] = 24 > oops [0x00007f4090970ae0,0x00007f4090970ae8] = 8 > scopes data [0x00007f4090970ae8,0x00007f4090970af0] = 8 > scopes pcs [0x00007f4090970af0,0x00007f4090970b10] = 32 > dependencies [0x00007f4090970b10,0x00007f4090970b18] = 8 Did you address @merykitty comment in RFE? You said: `it looks like this JBS does have this downsize, I will investigate this problem` src/hotspot/share/opto/parse.hpp line 647: > 645: } > 646: > 647: void suppress(Parse* parser, Parse::Block* path); Add comment describing this method. src/hotspot/share/opto/parse1.cpp line 665: > 663: record_for_igvn(unc); > 664: //tty->print("mark dead: "); > 665: //unc->dump(); Debug lines left over? src/hotspot/share/opto/parse2.cpp line 1586: > 1584: if (path_is_suitable_for_uncommon_trap(prob) && (!OptimizeUnstableIf || !path->is_merged())) { > 1585: sync_jvms(); > 1586: SafePointNode* sfpt = clone_map(); Seems `sync_jvms()` and `clone_map() are only needed for `OptimizeUnstableIf`. You need comment explaining why you added `!path->is_merged()` for `OptimizeUnstableIf` case. ------------- PR: https://git.openjdk.org/jdk/pull/9601 From xgong at openjdk.org Tue Jul 26 03:12:03 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 26 Jul 2022 03:12:03 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: Message-ID: <7MrA1CSI1OjDWTeCh6Ol8wFjzriRdAGGXLmJt4qZlM0=.99062a0e-4099-417c-8121-918954ea5145@github.com> On Wed, 20 Jul 2022 06:58:53 GMT, Xiaohong Gong wrote: > The Vector API unary operation "`REVERSE_BYTES`" should not emit any instructions for byte vectors. The same to the relative masked operation. Currently it emits `"mov dst, src"` on aarch64 when the "`dst`" and "`src`" are not the same register. But for the masked "`REVERSE_BYTES`", the compiler will always generate a "`VectorBlend`" which I think is redundant, since the first and second vector input is the same one. Please see the generated codes for the masked "`REVERSE_BYTES`" for byte type with NEON: > > ldr q16, [x15, #16] ; load the "src" vector > mov v17.16b, v16.16b ; reverse bytes "src" > ldr q18, [x13, #16] > neg v18.16b, v18.16b ; load the vector mask > bsl v18.16b, v17.16b, v16.16b ; vector blend > > The elements in register "`v17`" and "`v16`" are the same to each other, so the elements in result of "`bsl`" is the same to the original loaded values in "`v16`", no matter what the values in the vector mask are. > > To improve this, we can add the igvn transformations for "`ReverseBytesV`" and "`VectorBlend`" in compiler. For "`ReverseBytesV`", it can return the vector input if the basic element type is `T_BYTE`. And for "`VectorBlend`", it can return the first input if the first and the second input are the same one. > > Here is the performance data for the jmh benchmark [1] on ARM NEON: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19457.641 19516.124 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 12498.416 20528.004 ops/ms > > This patch may not have any influence to the non-masked "`REVERSE_BYTES`" on ARM NEON, because the backend may not emit any instruction for it before. > > And here is the performance data on an x86 system: > > Benchmark (size) Mode Cnt Before After Units > ByteMaxVector.REVERSE_BYTES 1024 thrpt 15 19358.941 20012.047 ops/ms > ByteMaxVector.REVERSE_BYTESMasked 1024 thrpt 15 15759.788 20389.996 ops/ms > > [1] https://github.com/openjdk/panama-vector/blob/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/ByteMaxVector.java#L2201 I?m sorry that I was in mistake for that the tests had passed, so I integrated this PR. If any tests fail I will revert this PR and fix it. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9565 From kvn at openjdk.org Tue Jul 26 03:12:12 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 03:12:12 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 13:18:05 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: > > - rename > - consolidate sse checks > - benchmark > - fix > - Merge branch 'master' into improveReplicate > - remove duplicate > - unsignness > - rematerializing input count > - fix comparison > - fix rematerialize, constant deduplication > - ... and 8 more: https://git.openjdk.org/jdk/compare/0599a05f...6c10f9ad I submitted testing. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From kvn at openjdk.org Tue Jul 26 03:17:59 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 03:17:59 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: <7MrA1CSI1OjDWTeCh6Ol8wFjzriRdAGGXLmJt4qZlM0=.99062a0e-4099-417c-8121-918954ea5145@github.com> References: <7MrA1CSI1OjDWTeCh6Ol8wFjzriRdAGGXLmJt4qZlM0=.99062a0e-4099-417c-8121-918954ea5145@github.com> Message-ID: On Tue, 26 Jul 2022 03:09:53 GMT, Xiaohong Gong wrote: > I?m sorry that I was in mistake for that the tests had passed, so I integrated this PR. If any tests fail I will revert this PR and fix it. Thanks! Testing passed and results are good. I looked. ------------- PR: https://git.openjdk.org/jdk/pull/9565 From xgong at openjdk.org Tue Jul 26 03:24:10 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Tue, 26 Jul 2022 03:24:10 GMT Subject: RFR: 8290485: [vector] REVERSE_BYTES for byte type should not emit any instructions In-Reply-To: References: <7MrA1CSI1OjDWTeCh6Ol8wFjzriRdAGGXLmJt4qZlM0=.99062a0e-4099-417c-8121-918954ea5145@github.com> Message-ID: On Tue, 26 Jul 2022 03:15:24 GMT, Vladimir Kozlov wrote: > > I?m sorry that I was in mistake for that the tests had passed, so I integrated this PR. If any tests fail I will revert this PR and fix it. Thanks! > > Testing passed and results are good. I looked. That's great! Thanks so much for that! ------------- PR: https://git.openjdk.org/jdk/pull/9565 From jbhateja at openjdk.org Tue Jul 26 05:57:39 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 26 Jul 2022 05:57:39 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 13:18:05 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: > > - rename > - consolidate sse checks > - benchmark > - fix > - Merge branch 'master' into improveReplicate > - remove duplicate > - unsignness > - rematerializing input count > - fix comparison > - fix rematerialize, constant deduplication > - ... and 8 more: https://git.openjdk.org/jdk/compare/0599a05f...6c10f9ad src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1662: > 1660: case 64: vmovups(dst, src, Assembler::AVX_512bit); break; > 1661: default: ShouldNotReachHere(); > 1662: } Vector Load/store from memory happens from dedicated ports, can you elaborate why this change will benefit. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4388: > 4386: > 4387: void MacroAssembler::vallones(XMMRegister dst, int vector_len) { > 4388: // vpcmpeqd has special dependency treatment so it should be preferred to vpternlogd Comment is not clear, adding relevant reference will add more value. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From rrich at openjdk.org Tue Jul 26 07:55:50 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 26 Jul 2022 07:55:50 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 06:41:57 GMT, Richard Reingruber wrote: >> The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). >> >> `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. >> >> This fix simply removes the special case for interpreted frames in the shared method `Continuation::continuation_bottom_sender()`. I cannot see a reason for the distinction between interpreted and compiled frames. The shared code reference to `frame::interpreter_frame_last_sp()` is thereby eliminated. >> >> Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' > - Remove platform dependent method interpreter_frame_last_sp() from shared code Ping? The change is tiny. ------------- PR: https://git.openjdk.org/jdk/pull/9411 From eosterlund at openjdk.org Tue Jul 26 08:03:49 2022 From: eosterlund at openjdk.org (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 26 Jul 2022 08:03:49 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 06:41:57 GMT, Richard Reingruber wrote: >> The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). >> >> `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. >> >> This fix simply removes the special case for interpreted frames in the shared method `Continuation::continuation_bottom_sender()`. I cannot see a reason for the distinction between interpreted and compiled frames. The shared code reference to `frame::interpreter_frame_last_sp()` is thereby eliminated. >> >> Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' > - Remove platform dependent method interpreter_frame_last_sp() from shared code Looks good. ------------- Marked as reviewed by eosterlund (Reviewer). PR: https://git.openjdk.org/jdk/pull/9411 From jbhateja at openjdk.org Tue Jul 26 08:07:47 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Tue, 26 Jul 2022 08:07:47 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 05:53:31 GMT, Jatin Bhateja wrote: > Vector Load/store from memory happens from dedicated ports, can you elaborate why this change will benefit. Above reference to section 3.5.5.2 also states that FP loads adds another cycle of latency, but saving the cycles penalty due to bypass b/w FP and SIMD domains still holds good. So may be for load there is no pressing need and existing load vector handling can be kept as it is. Overall savings from constant table size reductions are very impressive. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From rrich at openjdk.org Tue Jul 26 08:08:51 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Tue, 26 Jul 2022 08:08:51 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 08:00:09 GMT, Erik ?sterlund wrote: > Looks good. That was prompt, thanks a lot! :) ------------- PR: https://git.openjdk.org/jdk/pull/9411 From fyang at openjdk.org Tue Jul 26 08:38:03 2022 From: fyang at openjdk.org (Fei Yang) Date: Tue, 26 Jul 2022 08:38:03 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Mon, 11 Jul 2022 08:41:21 GMT, Pengfei Li wrote: > Fuzzer tests report an assertion failure issue in C2 global code motion > phase. Git bisection shows the problem starts after our fix of post loop > vectorization (JDK-8183390). After some narrowing down work, we find it > is caused by below change in that patch. > > > @@ -422,14 +404,7 @@ > cl->mark_passed_slp(); > } > cl->mark_was_slp(); > - if (cl->is_main_loop()) { > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } else if (post_loop_allowed) { > - if (!small_basic_type) { > - // avoid replication context for small basic types in programmable masked loops > - cl->set_slp_max_unroll(local_loop_unroll_factor); > - } > - } > + cl->set_slp_max_unroll(local_loop_unroll_factor); > } > } > > > This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it > helps find a loop's max unroll count via some analysis. In the original > code, we have loop type checks and the slp max unroll value is set for > only some types of loops. But in JDK-8183390, the check was removed by > mistake. In my current understanding, the slp max unroll value applies > to slp candidate loops only - either main loops or RCE'd post loops - > so that check shouldn't be removed. After restoring it we don't see the > assertion failure any more. > > The new jtreg created in this patch can reproduce the failed assertion, > which checks `def_block->dominates(block)` - the domination relationship > of two blocks. But in the case, I found the blocks are in an unreachable > inner loop, which I think ought to be optimized away in some previous C2 > phases. As I'm not quite familiar with the C2's global code motion, so > far I still don't understand how slp max unroll count eventually causes > that problem. This patch just restores the if condition which I removed > incorrectly in JDK-8183390. But I still suspect that there is another > hidden bug exists in C2. I would be glad if any reviewers can give me > some guidance or suggestions. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. Sorry, but seems that the same assertion failure is still happening when running the newly added test case with fastdebug build on linux-riscv64 platform. And I have attached the hs_err and reply files on the JBS issue. Please take another look. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From pli at openjdk.org Tue Jul 26 10:07:01 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 26 Jul 2022 10:07:01 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 08:34:42 GMT, Fei Yang wrote: >> Fuzzer tests report an assertion failure issue in C2 global code motion >> phase. Git bisection shows the problem starts after our fix of post loop >> vectorization (JDK-8183390). After some narrowing down work, we find it >> is caused by below change in that patch. >> >> >> @@ -422,14 +404,7 @@ >> cl->mark_passed_slp(); >> } >> cl->mark_was_slp(); >> - if (cl->is_main_loop()) { >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } else if (post_loop_allowed) { >> - if (!small_basic_type) { >> - // avoid replication context for small basic types in programmable masked loops >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } >> - } >> + cl->set_slp_max_unroll(local_loop_unroll_factor); >> } >> } >> >> >> This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it >> helps find a loop's max unroll count via some analysis. In the original >> code, we have loop type checks and the slp max unroll value is set for >> only some types of loops. But in JDK-8183390, the check was removed by >> mistake. In my current understanding, the slp max unroll value applies >> to slp candidate loops only - either main loops or RCE'd post loops - >> so that check shouldn't be removed. After restoring it we don't see the >> assertion failure any more. >> >> The new jtreg created in this patch can reproduce the failed assertion, >> which checks `def_block->dominates(block)` - the domination relationship >> of two blocks. But in the case, I found the blocks are in an unreachable >> inner loop, which I think ought to be optimized away in some previous C2 >> phases. As I'm not quite familiar with the C2's global code motion, so >> far I still don't understand how slp max unroll count eventually causes >> that problem. This patch just restores the if condition which I removed >> incorrectly in JDK-8183390. But I still suspect that there is another >> hidden bug exists in C2. I would be glad if any reviewers can give me >> some guidance or suggestions. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > > Sorry, but seems that the same assertion failure is still happening when running the newly added test case with fastdebug build on linux-riscv64 platform. And I have attached the hs_err and reply files on the JBS issue. Please take another look. Thanks @RealFYang for the information. I'm still investigating this in jdk 20 but so far I haven't got a clear clue. Just find if I ban `SuperWord::unrolling_analysis()` for normal loops, specifically with below change, that assertion fails again. diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp index ef66840628f..997fae367b1 100644 --- a/src/hotspot/share/opto/superword.cpp +++ b/src/hotspot/share/opto/superword.cpp @@ -217,6 +217,7 @@ void SuperWord::unrolling_analysis(int &local_loop_unroll_factor) { int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); Node_Stack nstack((int)ignored_size); CountedLoopNode *cl = lpt()->_head->as_CountedLoop(); + if (cl->is_normal_loop()) return; Node *cl_exit = cl->loopexit_or_null(); int rpo_idx = _post_block.length(); And I saw [JDK-8275330](https://github.com/openjdk/jdk/pull/6429) by Roland fixed the same assertion failure before. And the test case which causes the failure looks similar with this one - there is an inner dead loop which fails to be optimized away by C2. Hi @rwestrel , may I ask if you have any ideas or hints about this? ------------- PR: https://git.openjdk.org/jdk19/pull/130 From thartmann at openjdk.org Tue Jul 26 10:36:25 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 26 Jul 2022 10:36:25 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Mon, 25 Jul 2022 05:29:05 GMT, Jatin Bhateja wrote: >> Hi All, >> >> - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. >> - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. >> - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8287794: Review comments resolved. All tests passed. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From thartmann at openjdk.org Tue Jul 26 10:40:06 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 26 Jul 2022 10:40:06 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 05:38:01 GMT, Tobias Hartmann wrote: >> C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. >> >> The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: >> https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 >> >> However, the control flow analysis does not catch this case. I added the missing check. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Modified debug printing code Anyone up for a second review? ------------- PR: https://git.openjdk.org/jdk/pull/9589 From thartmann at openjdk.org Tue Jul 26 10:40:47 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 26 Jul 2022 10:40:47 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v3] In-Reply-To: References: Message-ID: <8hRxoV9kaBCaORQWauuwPKDpimaFTVf0MFjOIu11heM=.fe359182-ebb0-4c63-99d2-1becc4579e42@github.com> On Fri, 22 Jul 2022 03:57:22 GMT, Pengfei Li wrote: >> Recently we found some array range checks in loops are not hoisted by >> C2's loop predication phase as expected. Below is a typical case. >> >> for (int i = 0; i < size; i++) { >> b[3 * i] = a[3 * i]; >> } >> >> Ideally, C2 can hoist the range check of an array access in loop if the >> array index is a linear function of the loop's induction variable (iv). >> Say, range check in `arr[exp]` can be hoisted if >> >> exp = k1 * iv + k2 + inv >> >> where `k1` and `k2` are compile-time constants, and `inv` is an optional >> loop invariant. But in above case, C2 igvn does some strength reduction >> on the `MulINode` used to compute `3 * i`. It results in the linear index >> expression not being recognized. So far we found 2 ideal transformations >> that may affect linear expression recognition. They are >> >> - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values >> - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value >> >> To avoid range check hoisting and further optimizations being broken, we >> have tried improving the linear recognition. But after some experiments, >> we found complex and recursive pattern match does not always work well. >> In this patch we propose to defer these 2 ideal transformations to the >> phase of post loop igvn. In other words, these 2 strength reductions can >> only be done after all loop optimizations are over. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. >> We also tested the performance via JMH and see obvious improvement. >> >> Benchmark Improvement >> RangeCheckHoisting.ivScaled3 +21.2% >> RangeCheckHoisting.ivScaled7 +6.6% > > Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: > > Address more comments All tests passed. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From thartmann at openjdk.org Tue Jul 26 10:43:43 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Tue, 26 Jul 2022 10:43:43 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 10:03:44 GMT, Pengfei Li wrote: >> Sorry, but seems that the same assertion failure is still happening when running the newly added test case with fastdebug build on linux-riscv64 platform. And I have attached the hs_err and reply files on the JBS issue. Please take another look. > > Thanks @RealFYang for the information. I'm still investigating this in jdk 20 but so far I haven't got a clear clue. Just find if I ban `SuperWord::unrolling_analysis()` for normal loops, specifically with below change, that assertion fails again. > > diff --git a/src/hotspot/share/opto/superword.cpp b/src/hotspot/share/opto/superword.cpp > index ef66840628f..997fae367b1 100644 > --- a/src/hotspot/share/opto/superword.cpp > +++ b/src/hotspot/share/opto/superword.cpp > @@ -217,6 +217,7 @@ void SuperWord::unrolling_analysis(int &local_loop_unroll_factor) { > int *ignored_loop_nodes = NEW_RESOURCE_ARRAY(int, ignored_size); > Node_Stack nstack((int)ignored_size); > CountedLoopNode *cl = lpt()->_head->as_CountedLoop(); > + if (cl->is_normal_loop()) return; > Node *cl_exit = cl->loopexit_or_null(); > int rpo_idx = _post_block.length(); > > > And I saw [JDK-8275330](https://github.com/openjdk/jdk/pull/6429) by Roland fixed the same assertion failure before. And the test case which causes the failure looks similar with this one - there is an inner dead loop which fails to be optimized away by C2. > > Hi @rwestrel , may I ask if you have any ideas or hints about this? @pfustc, @RealFYang Please file a new bug for the remaining issue. Thanks! ------------- PR: https://git.openjdk.org/jdk19/pull/130 From jwaters at openjdk.org Tue Jul 26 12:19:34 2022 From: jwaters at openjdk.org (Julian Waters) Date: Tue, 26 Jul 2022 12:19:34 GMT Subject: RFR: 8291002: Rename Method::build_interpreter_method_data to Method::build_profiling_method_data Message-ID: As mentioned in the review process for [JDK-8290834](https://bugs.openjdk.org/browse/JDK-8290834) `build_interpreter_method_data` is misleading because it is actually used for creating MethodData*s throughout HotSpot, not just in the interpreter. Renamed the method to `build_profiling_method_data` instead to more accurately describe what it is used for. ------------- Commit messages: - Rename build_interpreter_method_data -> build_profiling_method_data Changes: https://git.openjdk.org/jdk/pull/9637/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9637&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291002 Stats: 12 lines in 9 files changed: 0 ins; 0 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/9637.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9637/head:pull/9637 PR: https://git.openjdk.org/jdk/pull/9637 From duke at openjdk.org Tue Jul 26 12:32:12 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 26 Jul 2022 12:32:12 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v9] In-Reply-To: References: Message-ID: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay > > With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Units Gain > SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% > SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% > SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% > SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% > > As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. > > This patch also removes some redundant code paths and renames some incorrectly named instructions. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: address comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/7832/files - new: https://git.openjdk.org/jdk/pull/7832/files/6c10f9ad..6ec8519f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=08 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=07-08 Stats: 38 lines in 4 files changed: 0 ins; 14 del; 24 mod Patch: https://git.openjdk.org/jdk/pull/7832.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832 PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Tue Jul 26 12:32:14 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 26 Jul 2022 12:32:14 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v2] In-Reply-To: References: <1FBk3MauXFxUsyHz9kuhqGI-CtLRgHYmHn1eyyaDLvs=.6d4d94b0-32a0-42dc-a181-87df8d8f3b65@github.com> Message-ID: On Wed, 16 Mar 2022 17:25:53 GMT, Jatin Bhateja wrote: >>> Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel? 64 and IA-32 Architectures Optimization Reference Manual" >>> >>> > When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow. >>> >>> The manual mentions the guideline at section 3.5.2.2 >>> >>> ![image](https://user-images.githubusercontent.com/49088128/158618209-c0674ba7-1c93-4014-a7e1-330f4e5846da.png) >>> >>> Thanks. >> >> Thanks meant to refer to above text. I have removed incorrect reference. > >> > Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel? 64 and IA-32 Architectures Optimization Reference Manual" >> > > When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow. >> > >> > >> > The manual mentions the guideline at section 3.5.2.2 >> > ![image](https://user-images.githubusercontent.com/49088128/158618209-c0674ba7-1c93-4014-a7e1-330f4e5846da.png) >> > Thanks. >> >> Thanks meant to refer to above text. I have removed incorrect reference. > > It will still be good if we can come up with a micro benchmark, that shows the gain with the patch. @jatin-bhateja Thanks a lot for your comments, I have addressed them in the last commit ------------- PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Tue Jul 26 12:32:16 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 26 Jul 2022 12:32:16 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 08:04:55 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1662: >> >>> 1660: case 64: vmovups(dst, src, Assembler::AVX_512bit); break; >>> 1661: default: ShouldNotReachHere(); >>> 1662: } >> >> Vector Load/store from memory happens from dedicated ports, can you elaborate why this change will benefit. > >> Vector Load/store from memory happens from dedicated ports, can you elaborate why this change will benefit. > > Above reference to section 3.5.5.2 also states that FP loads adds another cycle of latency, but saving the cycles penalty due to bypass b/w FP and SIMD domains still holds good. So may be for load there is no pressing need and existing load vector handling can be kept as it is. > > Overall savings from constant table size reductions are very impressive. Thanks. Thanks for your sharing, I have reverted the change here ------------- PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Tue Jul 26 12:52:16 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 26 Jul 2022 12:52:16 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: Message-ID: <0TH2Cv2t4pTvoEZ9c4MLAtMqTWF2_tHYwFq-Z_pmbbQ=.5ab14625-26c7-4cb8-914e-51b3059a69fb@github.com> On Tue, 26 Jul 2022 05:53:26 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 18 commits: >> >> - rename >> - consolidate sse checks >> - benchmark >> - fix >> - Merge branch 'master' into improveReplicate >> - remove duplicate >> - unsignness >> - rematerializing input count >> - fix comparison >> - fix rematerialize, constant deduplication >> - ... and 8 more: https://git.openjdk.org/jdk/compare/0599a05f...6c10f9ad > > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4388: > >> 4386: >> 4387: void MacroAssembler::vallones(XMMRegister dst, int vector_len) { >> 4388: // vpcmpeqd has special dependency treatment so it should be preferred to vpternlogd > > Comment is not clear, adding relevant reference will add more value. I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification. The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline). I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake pcmpeqd xmm0, xmm0 paddd xmm0, xmm1 paddd xmm0, xmm1 paddd xmm0, xmm1 This version has the throughput of 4 cycles/iteration vpcmpeqd xmm0, xmm0, xmm0 vpaddd xmm0, xmm1, xmm0 vpaddd xmm0, xmm1, xmm0 vpaddd xmm0, xmm1, xmm0 Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction. Thanks. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From aph at openjdk.org Tue Jul 26 13:13:03 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 26 Jul 2022 13:13:03 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 03:47:24 GMT, Hao Sun wrote: >> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules > > Merge branch "master". > - Add m4 file > > Add the corresponding M4 file > - Add VM_Version flag to control NEON instruction generation > > Add VM_Version flag use_neon_for_vector() to control whether to generate > NEON instructions for 128-bit vector operations. > > Currently only vector length is checked inside and it returns true for > existing SVE cores. More specific things might be checked in the near > future, e.g., the basic data type or SVE CPU model. > > Besides, new macro assembler helpers neon_vector_extend/narrow() are > introduced to make the code clean. > > Note: AddReductionVF/D rules are updated so that SVE instructions are > generated for 64/128-bit vector operations, because floating point > reduction add instructions are supported directly in SVE. > - Merge branch 'master' as of 7th-July into 8285790-merge-rules > - 8285790: AArch64: Merge C2 NEON and SVE matching rules > > MOTIVATION > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > Drawback-2: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > KEY UPDATES > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves Drawback-2. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > MINOR UPDATES and CODE REFACTORING > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > 1. Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > 2. Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > 3. Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > TESTING > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > BENEFITS > > The number of matching rules is reduced to ~ 42%. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313(aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (nearly 1 MB reduction) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 27: > 25: > 26: dnl Generate the warning > 27: // This file is automatically generated by running "m4 aarch64_vector_ad.m4". Do not edit! We had to put a message like this one around every method in `aarch64.ad` or people edited them by hand. Maybe that won't be a problem in this file because it's an entire file, not parts of one. src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1887: > 1885: dnl REDUCE_BITWISE_OP_NEON($1, $2 $3 $4 ) > 1886: dnl REDUCE_BITWISE_OP_NEON(insn_name, is_long, type, op_name) > 1887: define(`REDUCE_BITWIESE_OP_NEON', ` `REDUCE_BITWIESE_OP_NEON` doesn't look right, ------------- PR: https://git.openjdk.org/jdk/pull/9346 From duke at openjdk.org Tue Jul 26 13:26:04 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Tue, 26 Jul 2022 13:26:04 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v10] In-Reply-To: References: Message-ID: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay > > With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Units Gain > SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% > SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% > SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% > SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% > > As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. > > This patch also removes some redundant code paths and renames some incorrectly named instructions. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: replI_mem ------------- Changes: - all: https://git.openjdk.org/jdk/pull/7832/files - new: https://git.openjdk.org/jdk/pull/7832/files/6ec8519f..c049d542 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=08-09 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/7832.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832 PR: https://git.openjdk.org/jdk/pull/7832 From pli at openjdk.org Tue Jul 26 13:37:08 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 26 Jul 2022 13:37:08 GMT Subject: [jdk19] RFR: 8289954: C2: Assert failed in PhaseCFG::verify() after JDK-8183390 In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 08:34:42 GMT, Fei Yang wrote: >> Fuzzer tests report an assertion failure issue in C2 global code motion >> phase. Git bisection shows the problem starts after our fix of post loop >> vectorization (JDK-8183390). After some narrowing down work, we find it >> is caused by below change in that patch. >> >> >> @@ -422,14 +404,7 @@ >> cl->mark_passed_slp(); >> } >> cl->mark_was_slp(); >> - if (cl->is_main_loop()) { >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } else if (post_loop_allowed) { >> - if (!small_basic_type) { >> - // avoid replication context for small basic types in programmable masked loops >> - cl->set_slp_max_unroll(local_loop_unroll_factor); >> - } >> - } >> + cl->set_slp_max_unroll(local_loop_unroll_factor); >> } >> } >> >> >> This change is in function `SuperWord::unrolling_analysis()`. AFAIK, it >> helps find a loop's max unroll count via some analysis. In the original >> code, we have loop type checks and the slp max unroll value is set for >> only some types of loops. But in JDK-8183390, the check was removed by >> mistake. In my current understanding, the slp max unroll value applies >> to slp candidate loops only - either main loops or RCE'd post loops - >> so that check shouldn't be removed. After restoring it we don't see the >> assertion failure any more. >> >> The new jtreg created in this patch can reproduce the failed assertion, >> which checks `def_block->dominates(block)` - the domination relationship >> of two blocks. But in the case, I found the blocks are in an unreachable >> inner loop, which I think ought to be optimized away in some previous C2 >> phases. As I'm not quite familiar with the C2's global code motion, so >> far I still don't understand how slp max unroll count eventually causes >> that problem. This patch just restores the if condition which I removed >> incorrectly in JDK-8183390. But I still suspect that there is another >> hidden bug exists in C2. I would be glad if any reviewers can give me >> some guidance or suggestions. >> >> Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > > Sorry, but seems that the same assertion failure is still happening when running the newly added test case with fastdebug build on linux-riscv64 platform. And I have attached the hs_err and reply files on the JBS issue. Please take another look. @RealFYang, I have created a new JBS: https://bugs.openjdk.org/browse/JDK-8291025 and attached your hs_err_* file. Feel free to edit it if you have something to add. BTW: It may be helpful if you could provide the output with VM option `-XX:+TraceSuperWordLoopUnrollAnalysis`. ------------- PR: https://git.openjdk.org/jdk19/pull/130 From pli at openjdk.org Tue Jul 26 13:41:05 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 26 Jul 2022 13:41:05 GMT Subject: RFR: 8289996: Fix array range check hoisting for some scaled loop iv [v3] In-Reply-To: <8hRxoV9kaBCaORQWauuwPKDpimaFTVf0MFjOIu11heM=.fe359182-ebb0-4c63-99d2-1becc4579e42@github.com> References: <8hRxoV9kaBCaORQWauuwPKDpimaFTVf0MFjOIu11heM=.fe359182-ebb0-4c63-99d2-1becc4579e42@github.com> Message-ID: On Tue, 26 Jul 2022 10:37:44 GMT, Tobias Hartmann wrote: > All tests passed. Thanks for testing. I will integrate this. ------------- PR: https://git.openjdk.org/jdk/pull/9508 From pli at openjdk.org Tue Jul 26 13:50:10 2022 From: pli at openjdk.org (Pengfei Li) Date: Tue, 26 Jul 2022 13:50:10 GMT Subject: Integrated: 8289996: Fix array range check hoisting for some scaled loop iv In-Reply-To: References: Message-ID: On Fri, 15 Jul 2022 08:07:34 GMT, Pengfei Li wrote: > Recently we found some array range checks in loops are not hoisted by > C2's loop predication phase as expected. Below is a typical case. > > for (int i = 0; i < size; i++) { > b[3 * i] = a[3 * i]; > } > > Ideally, C2 can hoist the range check of an array access in loop if the > array index is a linear function of the loop's induction variable (iv). > Say, range check in `arr[exp]` can be hoisted if > > exp = k1 * iv + k2 + inv > > where `k1` and `k2` are compile-time constants, and `inv` is an optional > loop invariant. But in above case, C2 igvn does some strength reduction > on the `MulINode` used to compute `3 * i`. It results in the linear index > expression not being recognized. So far we found 2 ideal transformations > that may affect linear expression recognition. They are > > - `k * iv` --> `iv << m + iv << n` if k is the sum of 2 pow-of-2 values > - `k * iv` --> `iv << m - iv` if k+1 is a pow-of-2 value > > To avoid range check hoisting and further optimizations being broken, we > have tried improving the linear recognition. But after some experiments, > we found complex and recursive pattern match does not always work well. > In this patch we propose to defer these 2 ideal transformations to the > phase of post loop igvn. In other words, these 2 strength reductions can > only be done after all loop optimizations are over. > > Tested hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1. > We also tested the performance via JMH and see obvious improvement. > > Benchmark Improvement > RangeCheckHoisting.ivScaled3 +21.2% > RangeCheckHoisting.ivScaled7 +6.6% This pull request has now been integrated. Changeset: 89390955 Author: Pengfei Li URL: https://git.openjdk.org/jdk/commit/893909558b0439e7727208eeb582416ffc4d9b37 Stats: 193 lines in 4 files changed: 186 ins; 2 del; 5 mod 8289996: Fix array range check hoisting for some scaled loop iv Co-authored-by: John R Rose Reviewed-by: roland, kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9508 From duke at openjdk.org Tue Jul 26 14:26:05 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Tue, 26 Jul 2022 14:26:05 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 08:15:50 GMT, Andrew Haley wrote: >> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: >> >> Replace trampoline_call1 with trampoline_call > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 637: > >> 635: // code. >> 636: PhaseOutput* phase_output = Compile::current()->output(); >> 637: in_scratch_emit_size = > > Looks reasonable enough. The only change is to check for `Compile::current()->output()` being null, right? Hi @theRealAph, I am sorry I did not get your comment. Could you please explain it? Thanks, Evgeny ------------- PR: https://git.openjdk.org/jdk/pull/9592 From aph at openjdk.org Tue Jul 26 14:46:05 2022 From: aph at openjdk.org (Andrew Haley) Date: Tue, 26 Jul 2022 14:46:05 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 14:22:42 GMT, Evgeny Astigeevich wrote: >> src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 637: >> >>> 635: // code. >>> 636: PhaseOutput* phase_output = Compile::current()->output(); >>> 637: in_scratch_emit_size = >> >> Looks reasonable enough. The only change is to check for `Compile::current()->output()` being null, right? > > Hi @theRealAph, > I am sorry I did not get your comment. Could you please explain it? > > Thanks, > Evgeny The addition is 'PhaseOutput* phase_output = Compile::current()->output();' then 'phase_output != NULL && phase_output->in_scratch_emit_size()' so AFAICS `Compile::current()->output()` is now checked for null, where it was not before. ------------- PR: https://git.openjdk.org/jdk/pull/9592 From adinn at openjdk.org Tue Jul 26 15:53:09 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 26 Jul 2022 15:53:09 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 03:47:24 GMT, Hao Sun wrote: >> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules > > Merge branch "master". > - Add m4 file > > Add the corresponding M4 file > - Add VM_Version flag to control NEON instruction generation > > Add VM_Version flag use_neon_for_vector() to control whether to generate > NEON instructions for 128-bit vector operations. > > Currently only vector length is checked inside and it returns true for > existing SVE cores. More specific things might be checked in the near > future, e.g., the basic data type or SVE CPU model. > > Besides, new macro assembler helpers neon_vector_extend/narrow() are > introduced to make the code clean. > > Note: AddReductionVF/D rules are updated so that SVE instructions are > generated for 64/128-bit vector operations, because floating point > reduction add instructions are supported directly in SVE. > - Merge branch 'master' as of 7th-July into 8285790-merge-rules > - 8285790: AArch64: Merge C2 NEON and SVE matching rules > > MOTIVATION > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > Drawback-2: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > KEY UPDATES > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves Drawback-2. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > MINOR UPDATES and CODE REFACTORING > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > 1. Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > 2. Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > 3. Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > TESTING > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > BENEFITS > > The number of matching rules is reduced to ~ 42%. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313(aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (nearly 1 MB reduction) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 169: > 167: case Op_MulReductionVI: > 168: case Op_MulReductionVL: > 169: // No multiply reduction instructions, and we emit scalar This comment is a little unclear. Is this what you actually mean? "No vector multiply reduction instructions, but we do emit scalar instructions for 64/128-bit vectors" ------------- PR: https://git.openjdk.org/jdk/pull/9346 From adinn at openjdk.org Tue Jul 26 16:11:09 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 26 Jul 2022 16:11:09 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: <7S4ft1YHNSsIQsitw-IFquASigpuu-zBqrpNmFpNEow=.6ac24e0e-5015-4b98-92c9-1e890f477700@github.com> On Mon, 25 Jul 2022 03:47:24 GMT, Hao Sun wrote: >> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules > > Merge branch "master". > - Add m4 file > > Add the corresponding M4 file > - Add VM_Version flag to control NEON instruction generation > > Add VM_Version flag use_neon_for_vector() to control whether to generate > NEON instructions for 128-bit vector operations. > > Currently only vector length is checked inside and it returns true for > existing SVE cores. More specific things might be checked in the near > future, e.g., the basic data type or SVE CPU model. > > Besides, new macro assembler helpers neon_vector_extend/narrow() are > introduced to make the code clean. > > Note: AddReductionVF/D rules are updated so that SVE instructions are > generated for 64/128-bit vector operations, because floating point > reduction add instructions are supported directly in SVE. > - Merge branch 'master' as of 7th-July into 8285790-merge-rules > - 8285790: AArch64: Merge C2 NEON and SVE matching rules > > MOTIVATION > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > Drawback-2: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > KEY UPDATES > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves Drawback-2. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > MINOR UPDATES and CODE REFACTORING > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > 1. Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > 2. Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > 3. Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > TESTING > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > BENEFITS > > The number of matching rules is reduced to ~ 42%. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313(aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (nearly 1 MB reduction) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3026: > 3024: EXTRACT_FP(D, fmovd, 2, D, 3) > 3025: > 3026: // ------------------------------ Vector mask loat/store ----------------------- Should be "load/store" ------------- PR: https://git.openjdk.org/jdk/pull/9346 From adinn at openjdk.org Tue Jul 26 16:22:55 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 26 Jul 2022 16:22:55 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 03:47:24 GMT, Hao Sun wrote: >> **MOTIVATION** >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> **Drawback-2**: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> **KEY UPDATES** >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> - Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> - Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves **Drawback-2**. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> **MINOR UPDATES and CODE REFACTORING** >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> - Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> - Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> - Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> **TESTING** >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> **BENEFITS** >> >> The number of matching rules is reduced to ~ **42%**. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313 (aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (**nearly 1 MB reduction**) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules > > Merge branch "master". > - Add m4 file > > Add the corresponding M4 file > - Add VM_Version flag to control NEON instruction generation > > Add VM_Version flag use_neon_for_vector() to control whether to generate > NEON instructions for 128-bit vector operations. > > Currently only vector length is checked inside and it returns true for > existing SVE cores. More specific things might be checked in the near > future, e.g., the basic data type or SVE CPU model. > > Besides, new macro assembler helpers neon_vector_extend/narrow() are > introduced to make the code clean. > > Note: AddReductionVF/D rules are updated so that SVE instructions are > generated for 64/128-bit vector operations, because floating point > reduction add instructions are supported directly in SVE. > - Merge branch 'master' as of 7th-July into 8285790-merge-rules > - 8285790: AArch64: Merge C2 NEON and SVE matching rules > > MOTIVATION > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > Drawback-2: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > KEY UPDATES > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves Drawback-2. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > MINOR UPDATES and CODE REFACTORING > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > 1. Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > 2. Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > 3. Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > TESTING > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > BENEFITS > > The number of matching rules is reduced to ~ 42%. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313(aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (nearly 1 MB reduction) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1144: > 1142: // 4B to 4I > 1143: assert(dst_vlen_in_bytes == 16 && dst_bt == T_INT, "unsupported"); > 1144: sxtl(dst, T8H, src, T8B); Is this line a copy paste error? ------------- PR: https://git.openjdk.org/jdk/pull/9346 From adinn at openjdk.org Tue Jul 26 16:22:55 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Tue, 26 Jul 2022 16:22:55 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: <5tvoj4XvXz_KxQU_Do81brOPbCoityw72WbiowEF3JU=.fb20d73b-a7f3-4eb8-a4fd-80ca320db9fc@github.com> On Tue, 26 Jul 2022 16:18:05 GMT, Andrew Dinn wrote: >> Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules >> >> Merge branch "master". >> - Add m4 file >> >> Add the corresponding M4 file >> - Add VM_Version flag to control NEON instruction generation >> >> Add VM_Version flag use_neon_for_vector() to control whether to generate >> NEON instructions for 128-bit vector operations. >> >> Currently only vector length is checked inside and it returns true for >> existing SVE cores. More specific things might be checked in the near >> future, e.g., the basic data type or SVE CPU model. >> >> Besides, new macro assembler helpers neon_vector_extend/narrow() are >> introduced to make the code clean. >> >> Note: AddReductionVF/D rules are updated so that SVE instructions are >> generated for 64/128-bit vector operations, because floating point >> reduction add instructions are supported directly in SVE. >> - Merge branch 'master' as of 7th-July into 8285790-merge-rules >> - 8285790: AArch64: Merge C2 NEON and SVE matching rules >> >> MOTIVATION >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> Drawback-2: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> KEY UPDATES >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves Drawback-2. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> MINOR UPDATES and CODE REFACTORING >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> 1. Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> 2. Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> 3. Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> TESTING >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> BENEFITS >> >> The number of matching rules is reduced to ~ 42%. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313(aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (nearly 1 MB reduction) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1144: > >> 1142: // 4B to 4I >> 1143: assert(dst_vlen_in_bytes == 16 && dst_bt == T_INT, "unsupported"); >> 1144: sxtl(dst, T8H, src, T8B); > > Is this line a copy paste error? Ah, sorry I see now -- it needs two steps to expand. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From kvn at openjdk.org Tue Jul 26 16:34:47 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 16:34:47 GMT Subject: RFR: 8291002: Rename Method::build_interpreter_method_data to Method::build_profiling_method_data In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 08:45:59 GMT, Julian Waters wrote: > As mentioned in the review process for [JDK-8290834](https://bugs.openjdk.org/browse/JDK-8290834) `build_interpreter_method_data` is misleading because it is actually used for creating MethodData*s throughout HotSpot, not just in the interpreter. Renamed the method to `build_profiling_method_data` instead to more accurately describe what it is used for. Good. Originally only Interpreter collected profiling data. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9637 From kvn at openjdk.org Tue Jul 26 16:43:03 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 16:43:03 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v10] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 13:26:04 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > replI_mem The testing of version 07 got failure when run vector tests with `-XX:UseAVX=0 -XX:UseSSE=2`: # Internal Error (/workspace/open/src/hotspot/share/opto/constantTable.cpp:217), pid=2750036, tid=2750067 # assert((constant_addr - _masm.code()->consts()->start()) == con.offset()) failed: must be: 8 == 0 Current CompileTask: C2: 287 29 % b compiler.codegen.TestByteVect::test_ci @ 2 (20 bytes) Stack: [0x00007f7abf144000,0x00007f7abf245000], sp=0x00007f7abf23fa30, free space=1006k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xb731c8] ConstantTable::emit(CodeBuffer&) const+0x1c8 V [libjvm.so+0x17c3673] PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x293 V [libjvm.so+0xb191bb] Compile::Code_Gen()+0x42b V [libjvm.so+0xb1e899] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1699 and # Internal Error (/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5095), pid=1431469, tid=1431493 # Error: assert(VM_Version::supports_ssse3()) failed Current CompileTask: C2: 468 240 % b 4 java.util.Arrays::fill @ 5 (21 bytes) Stack: [0x00007fdecd422000,0x00007fdecd523000], sp=0x00007fdecd51d8c0, free space=1006k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x73079c] Assembler::pshufb(XMMRegisterImpl*, XMMRegisterImpl*)+0x13c V [libjvm.so+0x4005d1] ReplB_regNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x1a1 V [libjvm.so+0x17be04e] PhaseOutput::scratch_emit_size(Node const*)+0x45e V [libjvm.so+0x17b4548] PhaseOutput::shorten_branches(unsigned int*)+0x2d8 V [libjvm.so+0x17c6faa] PhaseOutput::Output()+0xcfa V [libjvm.so+0xb191bb] Compile::Code_Gen()+0x42b V [libjvm.so+0xb1e899] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1699 ------------- PR: https://git.openjdk.org/jdk/pull/7832 From shade at openjdk.org Tue Jul 26 17:34:35 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 26 Jul 2022 17:34:35 GMT Subject: RFR: 8291048: x86: compiler/c2/irTests/TestAutoVectorization2DArray.java fails with lower SSE Message-ID: [JDK-8289801](https://bugs.openjdk.org/browse/JDK-8289801) whitelisted the UseSSE/UseAVX flags, but missed update in the test. So when we test x86_32 with lower SSE, it fails, as no `LoadVector`/etc nodes are getting emitted. Additional testing: - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (now passes) - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (now passes) - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) - [x] Linux x86_32 fastdebug `c2/irTests` default (still passes) - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (still passes) - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (still passes) - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) - [x] Linux x86_64 fastdebug `c2/irTests` default (still passes) ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/9646/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9646&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291048 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9646.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9646/head:pull/9646 PR: https://git.openjdk.org/jdk/pull/9646 From kvn at openjdk.org Tue Jul 26 18:24:19 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 26 Jul 2022 18:24:19 GMT Subject: RFR: 8291048: x86: compiler/c2/irTests/TestAutoVectorization2DArray.java fails with lower SSE In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 17:24:38 GMT, Aleksey Shipilev wrote: > [JDK-8289801](https://bugs.openjdk.org/browse/JDK-8289801) whitelisted the UseSSE/UseAVX flags, but missed update in the test. So when we test x86_32 with lower SSE, it fails, as no `LoadVector`/etc nodes are getting emitted. > > Additional testing: > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_32 fastdebug `c2/irTests` default (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` default (still passes) Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9646 From aturbanov at openjdk.org Tue Jul 26 19:15:14 2022 From: aturbanov at openjdk.org (Andrey Turbanov) Date: Tue, 26 Jul 2022 19:15:14 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V [v4] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 14:38:26 GMT, Sacha Coppey wrote: >> This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. >> It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It will be tested soon in Native Image as well. > > Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: > > Avoid using set_destination when call is not jal src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.riscv64/src/jdk/vm/ci/riscv64/RISCV64Kind.java line 111: > 109: > 110: public boolean isFP() { > 111: switch(this) { let's add space after `switch` ------------- PR: https://git.openjdk.org/jdk/pull/9587 From dlong at openjdk.org Tue Jul 26 20:47:06 2022 From: dlong at openjdk.org (Dean Long) Date: Tue, 26 Jul 2022 20:47:06 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 06:41:57 GMT, Richard Reingruber wrote: >> The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). >> >> `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. >> >> This fix simply removes the special case for interpreted frames in the shared method `Continuation::continuation_bottom_sender()`. I cannot see a reason for the distinction between interpreted and compiled frames. The shared code reference to `frame::interpreter_frame_last_sp()` is thereby eliminated. >> >> Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' > - Remove platform dependent method interpreter_frame_last_sp() from shared code Are you sure unextended_sp() returns the same thing as interpreter_frame_last_sp() on all platforms? I didn't think that was true for aarch64. Maybe what we need is a new shared API that will return what the continuation code expects, or promote interpreter_frame_last_sp() to be shared. It seems that all platforms implement it. @theRealAph @pron ------------- PR: https://git.openjdk.org/jdk/pull/9411 From jiefu at openjdk.org Tue Jul 26 23:22:52 2022 From: jiefu at openjdk.org (Jie Fu) Date: Tue, 26 Jul 2022 23:22:52 GMT Subject: RFR: 8291048: x86: compiler/c2/irTests/TestAutoVectorization2DArray.java fails with lower SSE In-Reply-To: References: Message-ID: <3FIv6TsBSXPHQUFXQwTR8cymcy5mdp-0lKxT0nS1lPo=.6f21afcb-40bc-4034-9a11-efb4fcf4e83f@github.com> On Tue, 26 Jul 2022 17:24:38 GMT, Aleksey Shipilev wrote: > [JDK-8289801](https://bugs.openjdk.org/browse/JDK-8289801) whitelisted the UseSSE/UseAVX flags, but missed update in the test. So when we test x86_32 with lower SSE, it fails, as no `LoadVector`/etc nodes are getting emitted. > > Additional testing: > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_32 fastdebug `c2/irTests` default (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` default (still passes) Marked as reviewed by jiefu (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9646 From xliu at openjdk.org Wed Jul 27 00:53:03 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 27 Jul 2022 00:53:03 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 05:38:01 GMT, Tobias Hartmann wrote: >> C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. >> >> The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: >> https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 >> >> However, the control flow analysis does not catch this case. I added the missing check. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Modified debug printing code test/hotspot/jtreg/compiler/stringopts/SideEffectBeforeConstructor.jasm line 54: > 52: putstatic Field result:"I"; > 53: aload_0; > 54: invokespecial Method java/lang/StringBuffer."":"(Ljava/lang/String;)V"; hi, @TobiHartmann , Is here the reason why you said "javac would not generate such code"? I don't think javac will insert "SideEffectBeforeConstructor::result++" btween new and invokespecial. ------------- PR: https://git.openjdk.org/jdk/pull/9589 From xliu at openjdk.org Wed Jul 27 02:07:10 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 27 Jul 2022 02:07:10 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Fri, 22 Jul 2022 05:38:01 GMT, Tobias Hartmann wrote: >> C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. >> >> The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: >> https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 >> >> However, the control flow analysis does not catch this case. I added the missing check. >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Modified debug printing code This fix is reasonable. LGTM. (I am not a reviewer). A side node to myself: any nodes with side effect between Initialize and () must commit because may throw an exception. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.org/jdk/pull/9589 From xliu at openjdk.org Wed Jul 27 02:12:04 2022 From: xliu at openjdk.org (Xin Liu) Date: Wed, 27 Jul 2022 02:12:04 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 00:49:52 GMT, Xin Liu wrote: >> Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: >> >> Modified debug printing code > > test/hotspot/jtreg/compiler/stringopts/SideEffectBeforeConstructor.jasm line 54: > >> 52: putstatic Field result:"I"; >> 53: aload_0; >> 54: invokespecial Method java/lang/StringBuffer."":"(Ljava/lang/String;)V"; > > hi, @TobiHartmann , > Is here the reason why you said "javac would not generate such code"? > I don't think javac will insert "SideEffectBeforeConstructor::result++" btween new and invokespecial. I tried that. I don't think there's a way to generate code like that using javac. So we fix this bug because somebody may emit weird bytecode sequences using asm? ------------- PR: https://git.openjdk.org/jdk/pull/9589 From haosun at openjdk.org Wed Jul 27 04:12:03 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 04:12:03 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 13:09:05 GMT, Andrew Haley wrote: >> Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules >> >> Merge branch "master". >> - Add m4 file >> >> Add the corresponding M4 file >> - Add VM_Version flag to control NEON instruction generation >> >> Add VM_Version flag use_neon_for_vector() to control whether to generate >> NEON instructions for 128-bit vector operations. >> >> Currently only vector length is checked inside and it returns true for >> existing SVE cores. More specific things might be checked in the near >> future, e.g., the basic data type or SVE CPU model. >> >> Besides, new macro assembler helpers neon_vector_extend/narrow() are >> introduced to make the code clean. >> >> Note: AddReductionVF/D rules are updated so that SVE instructions are >> generated for 64/128-bit vector operations, because floating point >> reduction add instructions are supported directly in SVE. >> - Merge branch 'master' as of 7th-July into 8285790-merge-rules >> - 8285790: AArch64: Merge C2 NEON and SVE matching rules >> >> MOTIVATION >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> Drawback-2: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> KEY UPDATES >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves Drawback-2. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> MINOR UPDATES and CODE REFACTORING >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> 1. Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> 2. Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> 3. Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> TESTING >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> BENEFITS >> >> The number of matching rules is reduced to ~ 42%. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313(aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (nearly 1 MB reduction) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 27: > >> 25: >> 26: dnl Generate the warning >> 27: // This file is automatically generated by running "m4 aarch64_vector_ad.m4". Do not edit! > > We had to put a message like this one around every method in `aarch64.ad` or people edited them by hand. Maybe that won't be a problem in this file because it's an entire file, not parts of one. Different from `aarch64.ad` file, the scope of this message is the whole file, not some sections inside the file. Hence, I think current version is fine. > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 1887: > >> 1885: dnl REDUCE_BITWISE_OP_NEON($1, $2 $3 $4 ) >> 1886: dnl REDUCE_BITWISE_OP_NEON(insn_name, is_long, type, op_name) >> 1887: define(`REDUCE_BITWIESE_OP_NEON', ` > > `REDUCE_BITWIESE_OP_NEON` doesn't look right, Thanks for pointing this out! Updated here and other similar macros. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From haosun at openjdk.org Wed Jul 27 04:11:57 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 04:11:57 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v4] In-Reply-To: References: Message-ID: > **MOTIVATION** > > This is a big refactoring patch of merging rules in aarch64_sve.ad and > aarch64_neon.ad. The motivation can also be found at [1]. > > Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE > and NEON codegen respectively. 1) For SVE rules we use vReg operand to > match VecA for an arbitrary length of vector type, when SVE is enabled; > 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for > 128-bit/64-bit vectors, when SVE is not enabled. > > This separation looked clean at the time of introducing SVE support. > However, there are two main drawbacks now. > > **Drawback-1**: NEON (Advanced SIMD) is the mandatory feature on AArch64 and > SVE vector registers share the lower 128 bits with NEON registers. For > some cases, even when SVE is enabled, we still prefer to match NEON > rules and emit NEON instructions. > > **Drawback-2**: With more and more vector rules added to support VectorAPI, > there are lots of rules in both two ad files with different predication > conditions, e.g., different values of UseSVE or vector type/size. > > Examples can be found in [1]. These two drawbacks make the code less > maintainable and increase the libjvm.so code size. > > **KEY UPDATES** > > In this patch, we mainly do two things, using generic vReg to match all > NEON/SVE vector registers and merging NEON/SVE matching rules. > > - Update-1: Use generic vReg to match all NEON/SVE vector registers > > Two different approaches were considered, and we prefer to use generic > vector solution but keep VecA operand for all >128-bit vectors. See the > last slide in [1]. All the changes lie in the AArch64 backend. > > 1) Some helpers are updated in aarch64.ad to enable generic vector on > AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), > is_reg2reg_move() and is_generic_vector(). > > 2) Operand vecA is created to match VecA register, and vReg is updated > to match VecA/D/X registers dynamically. > > With the introduction of generic vReg, difference in register types > between NEON rules and SVE rules can be eliminated, which makes it easy > to merge these rules. > > - Update-2: Try to merge existing rules > > As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is > introduced to hold the grouped and merged matching rules. > > 1) Similar rules with difference in vector type/size can be merged into > new rules, where different types and vector sizes are handled in the > codegen part, e.g., vadd(). This resolves **Drawback-2**. > > 2) In most cases, we tend to emit NEON instructions for 128-bit vector > operations on SVE platforms, e.g., vadd(). This resolves **Drawback-1**. > > It's important to note that there are some exceptions. > > Exception-1: For some rules, there are no direct NEON instructions, but > exists simple SVE implementation due to newly added SVE ISA. Such rules > include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, > reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, > reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. > > Exception-2: Vector mask generation and operation rules are different > because vector mask is stored in different types of registers between > NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. > > Exception-3: Shift right related rules are different because vector > shift right instructions differ a bit between NEON and SVE. > > For these exceptions, we emit NEON or SVE code simply based on UseSVE > options. > > **MINOR UPDATES and CODE REFACTORING** > > Since we've touched all lines of code during merging rules, we further > do more minor updates and refactoring. > > - Reduce regmask bits > > Stack slot alignment is handled specially for scalable vector, which > will firstly align to SlotsPerVecA, and then align to the real vector > length. We should guarantee SlotsPerVecA is no bigger than the real > vector length. Otherwise, unused stack space would be allocated. > > In AArch64 SVE, the vector length can be 128 to 2048 bits. However, > SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, > on a 128-bit SVE platform, the stack slot is aligned to 256 bits, > leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA > from 8 to 4. > > See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad > (chunk1 and vectora_reg). > > - Refactor NEON/SVE vector op support check. > > Merge NEON and SVE vector supported check into one single function. To > be consistent, SVE default size supported check now is relaxed from no > less than 64 bits to the same condition as NEON's min_vector_size(), > i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, > as we assume at least we will emit NEON code for those small vectors, > with unified rules. > > - Some notes for new rules > > 1) Since new rules are unique and it makes no sense to set different > "ins_cost", we turn to use the default cost. > > 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad > now. Hence, many SIMD pipeline classes at aarch64.ad become unused and > can be removed. > > 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the > matching rule names if needed. > a) 'le128b' means the vector length is less than or equal to 128 bits. > This rule can be matched on both NEON and 128-bit SVE. > b) 'gt128b' means the vector length is greater than 128 bits. This rule > can only be matched on SVE. > c) 'neon' means this rule can only be matched on NEON, i.e. the > generated instruction is not better than those in 128-bit SVE. > d) 'sve' means this rule is only matched on SVE for all possible vector > length, i.e. not limited to gt128b. > > Note-1: m4 file is not introduced because many duplications are highly > reduced now. > Note-2: We guess the code review for this big patch would probably take > some time and we may need to merge latest code from master branch from > time to time. We prefer to keep aarch64_neon/sve.ad and the > corresponding m4 files for easy comparison and review. Of course, they > will be finally removed after some solid reviews before integration. > Note-3: Several other minor refactorings are done in this patch, but we > cannot list all of them in the commit message. We have reviewed and > tested the rules carefully to guarantee the quality. > > **TESTING** > > 1) Cross compilations on arm32/s390/pps/riscv passed. > 2) tier1~3 jtreg passed on both x64 and aarch64 machines. > 3) vector tests: all the test cases under the following directories can > pass on both NEON and SVE systems with max vector length 16/32/64 bytes. > > "test/hotspot/jtreg/compiler/vectorapi/" > "test/jdk/jdk/incubator/vector/" > "test/hotspot/jtreg/compiler/vectorization/" > > 4) Performance evaluation: we choose vector micro-benchmarks from > panama-vector:vectorIntrinsics [2] to evaluate the performance of this > patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE > platform and one NEON platform, and didn't see any visiable regression > with NEON and SVE. We will continue to verify more cases on other > platforms with NEON and different SVE vector sizes. > > **BENEFITS** > > The number of matching rules is reduced to ~ **42%**. > before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 > after : 313 (aarch64_vector.ad) > > Code size for libjvm.so (release build) on aarch64 is reduced to ~ **96%**. > before: 25246528 B (commit 7905788e969) > after : 24208776 B (**nearly 1 MB reduction**) > > [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf > [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > Co-Developed-by: Ningsheng Jian > Co-Developed-by: Eric Liu Hao Sun has updated the pull request incrementally with one additional commit since the last revision: Improve comments and fix typos 1. Improve the comment for MulReductionV*. 2. Fix typos for REDUCE_BITWISE_OP_XX macros. 3. For the typo in vector mask load/store rules. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9346/files - new: https://git.openjdk.org/jdk/pull/9346/files/196fdbad..3e85e2ad Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9346&range=02-03 Stats: 27 lines in 2 files changed: 0 ins; 0 del; 27 mod Patch: https://git.openjdk.org/jdk/pull/9346.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9346/head:pull/9346 PR: https://git.openjdk.org/jdk/pull/9346 From haosun at openjdk.org Wed Jul 27 04:12:06 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 04:12:06 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules [v3] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 15:49:06 GMT, Andrew Dinn wrote: >> Hao Sun has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' as of 22nd-Jul into 8285790-merge-rules >> >> Merge branch "master". >> - Add m4 file >> >> Add the corresponding M4 file >> - Add VM_Version flag to control NEON instruction generation >> >> Add VM_Version flag use_neon_for_vector() to control whether to generate >> NEON instructions for 128-bit vector operations. >> >> Currently only vector length is checked inside and it returns true for >> existing SVE cores. More specific things might be checked in the near >> future, e.g., the basic data type or SVE CPU model. >> >> Besides, new macro assembler helpers neon_vector_extend/narrow() are >> introduced to make the code clean. >> >> Note: AddReductionVF/D rules are updated so that SVE instructions are >> generated for 64/128-bit vector operations, because floating point >> reduction add instructions are supported directly in SVE. >> - Merge branch 'master' as of 7th-July into 8285790-merge-rules >> - 8285790: AArch64: Merge C2 NEON and SVE matching rules >> >> MOTIVATION >> >> This is a big refactoring patch of merging rules in aarch64_sve.ad and >> aarch64_neon.ad. The motivation can also be found at [1]. >> >> Currently AArch64 has aarch64_sve.ad and aarch64_neon.ad to support SVE >> and NEON codegen respectively. 1) For SVE rules we use vReg operand to >> match VecA for an arbitrary length of vector type, when SVE is enabled; >> 2) For NEON rules we use vecX/vecD operands to match VecX/VecD for >> 128-bit/64-bit vectors, when SVE is not enabled. >> >> This separation looked clean at the time of introducing SVE support. >> However, there are two main drawbacks now. >> >> Drawback-1: NEON (Advanced SIMD) is the mandatory feature on AArch64 and >> SVE vector registers share the lower 128 bits with NEON registers. For >> some cases, even when SVE is enabled, we still prefer to match NEON >> rules and emit NEON instructions. >> >> Drawback-2: With more and more vector rules added to support VectorAPI, >> there are lots of rules in both two ad files with different predication >> conditions, e.g., different values of UseSVE or vector type/size. >> >> Examples can be found in [1]. These two drawbacks make the code less >> maintainable and increase the libjvm.so code size. >> >> KEY UPDATES >> >> In this patch, we mainly do two things, using generic vReg to match all >> NEON/SVE vector registers and merging NEON/SVE matching rules. >> >> Update-1: Use generic vReg to match all NEON/SVE vector registers >> >> Two different approaches were considered, and we prefer to use generic >> vector solution but keep VecA operand for all >128-bit vectors. See the >> last slide in [1]. All the changes lie in the AArch64 backend. >> >> 1) Some helpers are updated in aarch64.ad to enable generic vector on >> AArch64. See vector_ideal_reg(), pd_specialize_generic_vector_operand(), >> is_reg2reg_move() and is_generic_vector(). >> >> 2) Operand vecA is created to match VecA register, and vReg is updated >> to match VecA/D/X registers dynamically. >> >> With the introduction of generic vReg, difference in register types >> between NEON rules and SVE rules can be eliminated, which makes it easy >> to merge these rules. >> >> Update-2: Try to merge existing rules >> >> As updated in GensrcAdlc.gmk, new ad file, aarch64_vector.ad, is >> introduced to hold the grouped and merged matching rules. >> >> 1) Similar rules with difference in vector type/size can be merged into >> new rules, where different types and vector sizes are handled in the >> codegen part, e.g., vadd(). This resolves Drawback-2. >> >> 2) In most cases, we tend to emit NEON instructions for 128-bit vector >> operations on SVE platforms, e.g., vadd(). This resolves Drawback-1. >> >> It's important to note that there are some exceptions. >> >> Exception-1: For some rules, there are no direct NEON instructions, but >> exists simple SVE implementation due to newly added SVE ISA. Such rules >> include vloadconB, vmulL_neon, vminL_neon, vmaxL_neon, >> reduce_addF_le128b (4F case), reduce_and/or/xor_neon, reduce_minL_neon, >> reduce_maxL_neon, vcvtLtoF_neon, vcvtDtoI_neon, rearrange_HS_neon. >> >> Exception-2: Vector mask generation and operation rules are different >> because vector mask is stored in different types of registers between >> NEON and SVE, e.g., vmaskcmp_neon and vmask_truecount_neon rules. >> >> Exception-3: Shift right related rules are different because vector >> shift right instructions differ a bit between NEON and SVE. >> >> For these exceptions, we emit NEON or SVE code simply based on UseSVE >> options. >> >> MINOR UPDATES and CODE REFACTORING >> >> Since we've touched all lines of code during merging rules, we further >> do more minor updates and refactoring. >> >> 1. Reduce regmask bits >> >> Stack slot alignment is handled specially for scalable vector, which >> will firstly align to SlotsPerVecA, and then align to the real vector >> length. We should guarantee SlotsPerVecA is no bigger than the real >> vector length. Otherwise, unused stack space would be allocated. >> >> In AArch64 SVE, the vector length can be 128 to 2048 bits. However, >> SlotsPerVecA is currently set as 8, i.e. 8 * 32 = 256 bits. As a result, >> on a 128-bit SVE platform, the stack slot is aligned to 256 bits, >> leaving the half 128 bits unused. In this patch, we reduce SlotsPerVecA >> from 8 to 4. >> >> See the updates in register_aarch64.hpp, regmask.hpp and aarch64.ad >> (chunk1 and vectora_reg). >> >> 2. Refactor NEON/SVE vector op support check. >> >> Merge NEON and SVE vector supported check into one single function. To >> be consistent, SVE default size supported check now is relaxed from no >> less than 64 bits to the same condition as NEON's min_vector_size(), >> i.e. 4 bytes and 4/2 booleans are also supported. This should be fine, >> as we assume at least we will emit NEON code for those small vectors, >> with unified rules. >> >> 3. Some notes for new rules >> >> 1) Since new rules are unique and it makes no sense to set different >> "ins_cost", we turn to use the default cost. >> >> 2) By default we use "pipe_slow" for matching rules in aarch64_vector.ad >> now. Hence, many SIMD pipeline classes at aarch64.ad become unused and >> can be removed. >> >> 3) Suffixes '_le128b/_gt128b' and '_neon/_sve' are appended in the >> matching rule names if needed. >> a) 'le128b' means the vector length is less than or equal to 128 bits. >> This rule can be matched on both NEON and 128-bit SVE. >> b) 'gt128b' means the vector length is greater than 128 bits. This rule >> can only be matched on SVE. >> c) 'neon' means this rule can only be matched on NEON, i.e. the >> generated instruction is not better than those in 128-bit SVE. >> d) 'sve' means this rule is only matched on SVE for all possible vector >> length, i.e. not limited to gt128b. >> >> Note-1: m4 file is not introduced because many duplications are highly >> reduced now. >> Note-2: We guess the code review for this big patch would probably take >> some time and we may need to merge latest code from master branch from >> time to time. We prefer to keep aarch64_neon/sve.ad and the >> corresponding m4 files for easy comparison and review. Of course, they >> will be finally removed after some solid reviews before integration. >> Note-3: Several other minor refactorings are done in this patch, but we >> cannot list all of them in the commit message. We have reviewed and >> tested the rules carefully to guarantee the quality. >> >> TESTING >> >> 1) Cross compilations on arm32/s390/pps/riscv passed. >> 2) tier1~3 jtreg passed on both x64 and aarch64 machines. >> 3) vector tests: all the test cases under the following directories can >> pass on both NEON and SVE systems with max vector length 16/32/64 bytes. >> >> "test/hotspot/jtreg/compiler/vectorapi/" >> "test/jdk/jdk/incubator/vector/" >> "test/hotspot/jtreg/compiler/vectorization/" >> >> 4) Performance evaluation: we choose vector micro-benchmarks from >> panama-vector:vectorIntrinsics [2] to evaluate the performance of this >> patch. We've tested *MaxVectorTests.java cases on one 128-bit SVE >> platform and one NEON platform, and didn't see any visiable regression >> with NEON and SVE. We will continue to verify more cases on other >> platforms with NEON and different SVE vector sizes. >> >> BENEFITS >> >> The number of matching rules is reduced to ~ 42%. >> before: 373 (aarch64_neon.ad) + 380 (aarch64_sve.ad) = 753 >> after : 313(aarch64_vector.ad) >> >> Code size for libjvm.so (release build) on aarch64 is reduced to ~ 96%. >> before: 25246528 B (commit 7905788e969) >> after : 24208776 B (nearly 1 MB reduction) >> >> [1] http://cr.openjdk.java.net/~njian/8285790/JDK-8285790.pdf >> [2] https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> Co-Developed-by: Ningsheng Jian >> Co-Developed-by: Eric Liu > > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 169: > >> 167: case Op_MulReductionVI: >> 168: case Op_MulReductionVL: >> 169: // No multiply reduction instructions, and we emit scalar > > This comment is a little unclear. Is this what you actually mean? > > "No vector multiply reduction instructions, but we do emit scalar instructions for 64/128-bit vectors" Yes. You're right. Updated in the new revision. Thanks. > src/hotspot/cpu/aarch64/aarch64_vector_ad.m4 line 3026: > >> 3024: EXTRACT_FP(D, fmovd, 2, D, 3) >> 3025: >> 3026: // ------------------------------ Vector mask loat/store ----------------------- > > Should be "load/store" Yes, it's a typo. Updated in the new revision. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From haosun at openjdk.org Wed Jul 27 04:18:16 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 04:18:16 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> References: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> Message-ID: <5PnVG6a78nf5DYpwf3ILjlHtAV09TuGU8aYGsDJRNHE=.6b53a8fb-e543-42b5-866c-149fd7bf321a@github.com> On Mon, 4 Jul 2022 12:51:22 GMT, Andrew Haley wrote: >> Aha! I was looking forward to this. >> >> On 7/1/22 11:46, Hao Sun wrote: >> > Note-1: m4 file is not introduced because many duplications are highly >> > reduced now. >> >> Yes, but there's still a lot of duplications. I'll make a few examples >> of where you should make simple changes that will usefully increase the >> level of abstraction. That will be a start. > >> @theRealAph Thanks for your comment. Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file. >> >> But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform). After that we may discuss whether to use m4 file and how to implement it if needed. > > We can do both: there's no sense in which one excludes the other, and we have time. > > However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this: > > > if (length_in_bytes <= 16) { > // ... Neon > } else { > assert(UseSVE > 0, "must be sve"); > // ... SVE > } > > > which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption? @theRealAph @adinn Thanks for your reviews! I have updated the patch based on the review comments. Would you mind taking another look? Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9346 From fgao at openjdk.org Wed Jul 27 06:26:59 2022 From: fgao at openjdk.org (Fei Gao) Date: Wed, 27 Jul 2022 06:26:59 GMT Subject: RFR: 8289422: Fix and re-enable vector conditional move Message-ID: <6uthI29shZjAeLK-eV3Kxqao06qoa9U9zQ5g_oDLmkI=.3e171aae-2003-46c9-88ac-9a63fecc5d96@github.com> // float[] a, float[] b, float[] c; for (int i = 0; i < a.length; i++) { c[i] = (a[i] > b[i]) ? a[i] : b[i]; } After [JDK-8139340](https://bugs.openjdk.org/browse/JDK-8139340) and [JDK-8192846](https://bugs.openjdk.org/browse/JDK-8192846), we hope to vectorize the case above by enabling -XX:+UseCMoveUnconditionally and -XX:+UseVectorCmov. But the transformation here[1] is going to optimize the BoolNode with constant input to a constant and break the design logic of cmove vector node[2]. We can't prevent all GVN transformation to the BoolNode before matcher, so the patch keeps the condition input as a constant while creating a cmove vector node, and then restructures it into a binary tree before matching. When the input order of original cmp node is different from the input order of original cmove node, like: // float[] a, float[] b, float[] c; for (int i = 0; i < a.length; i++) { c[i] = (a[i] < b[i]) ? a[i] : b[i]; } the patch negates the mask of the BoolNode before creating the cmove vector node in SuperWord::output(). We can also use VectorNode::implemented() to consult if vector conditional move is supported in the backend. So, the patch cleans the related code in SuperWord::implemented(). With the patch, the performance uplift is: (The micro-benchmark functions are included in the file test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java) AArch64: Benchmark (length) Mode Cnt uplift(ns/op) cmoveD 523 avgt 15 68.89% cmoveF 523 avgt 15 72.40% X86: Benchmark (length) Mode Cnt uplift(ns/op) cmoveD 523 avgt 15 73.12% cmoveF 523 avgt 15 85.45% [1]https://github.com/openjdk/jdk/blob/779b4e1d1959bc15a27492b7e2b951678e39cca8/src/hotspot/share/opto/subnode.cpp#L1310 [2]https://github.com/openjdk/jdk/blob/779b4e1d1959bc15a27492b7e2b951678e39cca8/src/hotspot/share/opto/matcher.cpp#L2365 ------------- Commit messages: - 8289422: Fix and re-enable vector conditional move Changes: https://git.openjdk.org/jdk/pull/9652/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9652&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8289422 Stats: 290 lines in 9 files changed: 275 ins; 7 del; 8 mod Patch: https://git.openjdk.org/jdk/pull/9652.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9652/head:pull/9652 PR: https://git.openjdk.org/jdk/pull/9652 From haosun at openjdk.org Wed Jul 27 07:12:29 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 07:12:29 GMT Subject: RFR: 8290943: Fix several IR test issues on SVE after JDK-8289801 Message-ID: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> This patch was motivated by the test failure of TestPopCountVectorLong.java on SVE with UseSVE=0 after JDK-8289801. Both CPU feature CPU_SVE and VM option UseSVE should be checked to determine whether the SVE codegen part is available. Hence, 1) if one test case is designed to test multiple backends, we should specify this full check in the "requires" annotation, e.g., TestPopCoundVectorLong.java. 2) if one test case is designed to test only SVE backend, we may also specify UseSVE option to "runWithFlags()" in "main" function and only check CPU feature in "requires" annotation part, e.g., VectorMaskedNotTest.java[1]. This test failure can be easily resolved via adding the full check in the "requires" annotation. We further revisited all the SVE-oriented JTREG test cases and found two potential issues. 1) AllBitsSetVectorMatchRuleTest.java It's designed to test both NEON and SVE codegen. Since 1) ASIMD is the mandatory feature on SVE platforms and 2) SVE codegen would be selected by default on SVE platforms, it's no need to require SVE feature. 2) TestPopulateIndex.java Similar to TestPopCountVectorLong.java, this case is expected to fail as well on SVE with UseSVE=0. However, it didn't fail. The root cause is that it's not correct to simply check the "PopulateIndex" string because the test name i.e. TestPopulateIndex.java contains this string as well. Instead, we turn to check IRNode name. Testing: We ran the following vector test cases on 1) NEON-only platform, 2) SVE platform with UseSVE=0 specified, 3) SVE platform with UseSVE not specified(default on), 4) x64 platform. All the test cases passed. hotspot:compiler/vectorapi jdk:jdk/incubator/vector hotspot:compiler/vectorization [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java#L116 ------------- Commit messages: - 8290943: Fix several IR test issues on SVE after JDK-8289801 Changes: https://git.openjdk.org/jdk/pull/9653/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9653&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8290943 Stats: 8 lines in 4 files changed: 2 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/9653.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9653/head:pull/9653 PR: https://git.openjdk.org/jdk/pull/9653 From rrich at openjdk.org Wed Jul 27 08:25:19 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Wed, 27 Jul 2022 08:25:19 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 06:41:57 GMT, Richard Reingruber wrote: >> The method `frame::interpreter_frame_last_sp()` is a platform method in the sense that it is not declared in a shared header file. It is declared and defined on some platforms though (x86 and aarch64 I think). >> >> `frame::interpreter_frame_last_sp()` existed on these platforms before vm continuations (aka loom). Shared code that is part of the vm continuations implementation references it. This breaks the platform abstraction. >> >> This fix simply removes the special case for interpreted frames in the shared method `Continuation::continuation_bottom_sender()`. I cannot see a reason for the distinction between interpreted and compiled frames. The shared code reference to `frame::interpreter_frame_last_sp()` is thereby eliminated. >> >> Testing: hotspot_loom and jdk_loom on x86_64 and aarch64. > > Richard Reingruber has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' > - Remove platform dependent method interpreter_frame_last_sp() from shared code Hi Dean, thanks for looking at the PR. > Are you sure unextended_sp() returns the same thing as > interpreter_frame_last_sp() on all platforms? I didn't think that was true for > aarch64. I'm no sure. The question is difficult to answer because it is a platform detail. It is even hard to make assumptions about the unextended_sp that are true for all platforms[1]. This was a major issue in the loom port to ppc. Shared code expects to find call parameters at the caller's unextended_sp which is not the case on ppc. IMHO uses of unextended_sp in shared code should be replaced with abstractions as SharedRuntime::out_preserve_stack_slots(). > Maybe what we need is a new shared API that will return what the > continuation code expects, or promote interpreter_frame_last_sp() to be > shared. It seems that all platforms implement it. @theRealAph @pron I'd think any address within the frame is good for calling Continuation::get_continuation_entry_for_sp(). Thanks, Richard. [1] On ppc unextended_sp < sp is possible. See https://github.com/reinrich/loom/blob/4b79c83284f18dd7193ecfe12e97e07674d34405/src/hotspot/cpu/ppc/continuationFreezeThaw_ppc.inline.hpp#L570-L639 ------------- PR: https://git.openjdk.org/jdk/pull/9411 From duke at openjdk.org Wed Jul 27 09:38:30 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 27 Jul 2022 09:38:30 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v11] In-Reply-To: References: Message-ID: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay > > With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Units Gain > SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% > SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% > SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% > SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% > > As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. > > This patch also removes some redundant code paths and renames some incorrectly named instructions. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: fix errors regarding low sse ------------- Changes: - all: https://git.openjdk.org/jdk/pull/7832/files - new: https://git.openjdk.org/jdk/pull/7832/files/c049d542..6193233f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=10 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=09-10 Stats: 31 lines in 1 file changed: 17 ins; 2 del; 12 mod Patch: https://git.openjdk.org/jdk/pull/7832.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832 PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Wed Jul 27 09:40:45 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 27 Jul 2022 09:40:45 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay > > With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Units Gain > SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% > SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% > SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% > SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% > > As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. > > This patch also removes some redundant code paths and renames some incorrectly named instructions. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: unnecessary TEMP dst ------------- Changes: - all: https://git.openjdk.org/jdk/pull/7832/files - new: https://git.openjdk.org/jdk/pull/7832/files/6193233f..bc01c21b Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=11 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=10-11 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/7832.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832 PR: https://git.openjdk.org/jdk/pull/7832 From jiefu at openjdk.org Wed Jul 27 09:48:02 2022 From: jiefu at openjdk.org (Jie Fu) Date: Wed, 27 Jul 2022 09:48:02 GMT Subject: RFR: 8290943: Fix several IR test issues on SVE after JDK-8289801 In-Reply-To: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> References: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> Message-ID: On Wed, 27 Jul 2022 07:04:25 GMT, Hao Sun wrote: > This patch was motivated by the test failure of > TestPopCountVectorLong.java on SVE with UseSVE=0 after JDK-8289801. > > Both CPU feature CPU_SVE and VM option UseSVE should be checked to > determine whether the SVE codegen part is available. > > Hence, 1) if one test case is designed to test multiple backends, we > should specify this full check in the "requires" annotation, e.g., > TestPopCoundVectorLong.java. 2) if one test case is designed to test > only SVE backend, we may also specify UseSVE option to "runWithFlags()" > in "main" function and only check CPU feature in "requires" annotation > part, e.g., VectorMaskedNotTest.java[1]. > > This test failure can be easily resolved via adding the full check in > the "requires" annotation. > > We further revisited all the SVE-oriented JTREG test cases and found two > potential issues. > > 1) AllBitsSetVectorMatchRuleTest.java > It's designed to test both NEON and SVE codegen. Since 1) ASIMD is the > mandatory feature on SVE platforms and 2) SVE codegen would be selected > by default on SVE platforms, it's no need to require SVE feature. > > 2) TestPopulateIndex.java > Similar to TestPopCountVectorLong.java, this case is expected to fail as > well on SVE with UseSVE=0. However, it didn't fail. The root cause is > that it's not correct to simply check the "PopulateIndex" string because > the test name i.e. TestPopulateIndex.java contains this string as well. > Instead, we turn to check IRNode name. > > Testing: > We ran the following vector test cases on 1) NEON-only platform, 2) SVE > platform with UseSVE=0 specified, 3) SVE platform with UseSVE not > specified(default on), 4) x64 platform. All the test cases passed. > > > hotspot:compiler/vectorapi > jdk:jdk/incubator/vector > hotspot:compiler/vectorization > > > [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java#L116 Looks reasonable to me. ------------- Marked as reviewed by jiefu (Reviewer). PR: https://git.openjdk.org/jdk/pull/9653 From duke at openjdk.org Wed Jul 27 09:49:06 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Wed, 27 Jul 2022 09:49:06 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v10] In-Reply-To: References: Message-ID: <72OWdtfwNfhRBroOvJ1-EgVIsSGGCnnhdBDkcGUDd4w=.4ae9e3bb-2934-41d6-bf14-b44054ef97b1@github.com> On Tue, 26 Jul 2022 16:40:57 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> replI_mem > > The testing of version 07 got failure when run vector tests with `-XX:UseAVX=0 -XX:UseSSE=2`: > > # Internal Error (/workspace/open/src/hotspot/share/opto/constantTable.cpp:217), pid=2750036, tid=2750067 > # assert((constant_addr - _masm.code()->consts()->start()) == con.offset()) failed: must be: 8 == 0 > > Current CompileTask: > C2: 287 29 % b compiler.codegen.TestByteVect::test_ci @ 2 (20 bytes) > > Stack: [0x00007f7abf144000,0x00007f7abf245000], sp=0x00007f7abf23fa30, free space=1006k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0xb731c8] ConstantTable::emit(CodeBuffer&) const+0x1c8 > V [libjvm.so+0x17c3673] PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x293 > V [libjvm.so+0xb191bb] Compile::Code_Gen()+0x42b > V [libjvm.so+0xb1e899] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1699 > > > and > > > # Internal Error (/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5095), pid=1431469, tid=1431493 > # Error: assert(VM_Version::supports_ssse3()) failed > > Current CompileTask: > C2: 468 240 % b 4 java.util.Arrays::fill @ 5 (21 bytes) > > Stack: [0x00007fdecd422000,0x00007fdecd523000], sp=0x00007fdecd51d8c0, free space=1006k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0x73079c] Assembler::pshufb(XMMRegisterImpl*, XMMRegisterImpl*)+0x13c > V [libjvm.so+0x4005d1] ReplB_regNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x1a1 > V [libjvm.so+0x17be04e] PhaseOutput::scratch_emit_size(Node const*)+0x45e > V [libjvm.so+0x17b4548] PhaseOutput::shorten_branches(unsigned int*)+0x2d8 > V [libjvm.so+0x17c6faa] PhaseOutput::Output()+0xcfa > V [libjvm.so+0xb191bb] Compile::Code_Gen()+0x42b > V [libjvm.so+0xb1e899] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1699 @vnkozlov I have fixed those errors in the last commits. The second one is due to `pshufb` being supported only on ssse3 machines. And the first one is because the constant table itself is not aligned enough given that currently, it is only aligned at 8 bytes. I chose to avoid the problem and only emit constants requiring at most 8 bytes of alignment as this patch has already touched many areas. A proper solution would be in a separate issue. What do you think? Thanks a lot. https://github.com/openjdk/jdk/blob/2bd90c2149bfee4b045c8f376e8bcdf4420ccb5d/src/hotspot/share/asm/codeBuffer.hpp#L742 ------------- PR: https://git.openjdk.org/jdk/pull/7832 From adinn at openjdk.org Wed Jul 27 10:26:59 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 27 Jul 2022 10:26:59 GMT Subject: RFR: 8285790: AArch64: Merge C2 NEON and SVE matching rules In-Reply-To: <5PnVG6a78nf5DYpwf3ILjlHtAV09TuGU8aYGsDJRNHE=.6b53a8fb-e543-42b5-866c-149fd7bf321a@github.com> References: <_l-poEFY80LRmKZyRhpxvSohm6nLv_ruaO1_WzKmTlQ=.9faff21d-d67c-42e0-8de7-be2ca9397b88@github.com> <_ZcvC_kGQPJuD-WAwAocrGLlbHEF66ww-FgDhUl6av4=.06b039a0-ed2c-42cb-9e1e-c8f32c712789@github.com> <5PnVG6a78nf5DYpwf3ILjlHtAV09TuGU8aYGsDJRNHE=.6b53a8fb-e543-42b5-866c-149fd7bf321a@github.com> Message-ID: On Wed, 27 Jul 2022 04:14:08 GMT, Hao Sun wrote: >>> @theRealAph Thanks for your comment. Yes. There are still duplicate code. I can easily list several ones, such as the reduce-and/or/xor, vector shift ops and several reg with imm rules. We're open to keep m4 file. >>> >>> But I would suggest that we may put our attention firstly on 1) our implementation on generic vector registers and 2) the merged rules (in particular those we share the codegen for NEON only platform and 128-bit vector ops on SVE platform). After that we may discuss whether to use m4 file and how to implement it if needed. >> >> We can do both: there's no sense in which one excludes the other, and we have time. >> >> However, just putting aside for a moment the lack of useful abstraction mechanisms, I note that there's a lot of code like this: >> >> >> if (length_in_bytes <= 16) { >> // ... Neon >> } else { >> assert(UseSVE > 0, "must be sve"); >> // ... SVE >> } >> >> >> which is to say, there's an implicit assumption that if an operation can be done with Neon it will be, and SVE will only be used if not. What is the justification for that assumption? > > @theRealAph @adinn > Thanks for your reviews! > I have updated the patch based on the review comments. Would you mind taking another look? Thanks! @shqking Thanks for the update. This looks very impressive. You have factored out the different cases very cleanly into distinct groups, including using m4 to avoid a lot of 'cookie cutter' repetition, while still providing the ability to allow easy overrides for special cases, especially cases where we want to be able to choose whether we steer the implementation towards sve or neon. This change modifies a great deal of rule code and my concern is that this introduces a lot of opportunity for regressions. @theRealAph and I have both looked over the m4 and ad file rules and have not been able to spot anything that is wrong. Of course, that is no guarantee that everything is right but I am not sure we will find any more errors by reading the code. So, we are left with reliance on test coverage. Can you comment on how far you think the existing tests actually exercise the modified rules? Do you think we need to introduce more tests before committing this change? In particular 1. How good is our overall coverage for Neon 2. Are there rules for specific operations that you think are not well covered. ------------- PR: https://git.openjdk.org/jdk/pull/9346 From thartmann at openjdk.org Wed Jul 27 10:47:25 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 27 Jul 2022 10:47:25 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 02:05:00 GMT, Xin Liu wrote: >> Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: >> >> Modified debug printing code > > This fix is reasonable. LGTM. (I am not a reviewer). > > A side node to myself: any nodes with side effect between Initialize and () must commit because may throw an exception. Thanks for the review, @navyxliu! ------------- PR: https://git.openjdk.org/jdk/pull/9589 From thartmann at openjdk.org Wed Jul 27 10:47:28 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 27 Jul 2022 10:47:28 GMT Subject: RFR: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" [v2] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 02:08:43 GMT, Xin Liu wrote: >> test/hotspot/jtreg/compiler/stringopts/SideEffectBeforeConstructor.jasm line 54: >> >>> 52: putstatic Field result:"I"; >>> 53: aload_0; >>> 54: invokespecial Method java/lang/StringBuffer."":"(Ljava/lang/String;)V"; >> >> hi, @TobiHartmann , >> Is here the reason why you said "javac would not generate such code"? >> I don't think javac will insert "SideEffectBeforeConstructor::result++" btween new and invokespecial. > > I tried that. I don't think there's a way to generate code like that using javac. > So we fix this bug because somebody may emit weird bytecode sequences using asm? Yes, I don't think javac would ever put something between new and the invokespecial of the constructor. At least I was not able to trigger that. > So we fix this bug because somebody may emit weird bytecode sequences using asm? Yes. The JVM needs to handle **all** valid bytecode, not only bytecode generated by javac. Not only are there other Java compilers but also different languages (like Scala) that compile to bytecode and run on the JVM. ------------- PR: https://git.openjdk.org/jdk/pull/9589 From thartmann at openjdk.org Wed Jul 27 10:47:30 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 27 Jul 2022 10:47:30 GMT Subject: Integrated: 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 10:55:41 GMT, Tobias Hartmann wrote: > C2's string concatenation optimization (`OptimizeStringConcat`) does not correctly handle side effecting instructions between StringBuffer Allocate/Initialize and the call to the constructor. In the failing test, see `SideEffectBeforeConstructor::test`, a `result` field is incremented just before the constructor is invoked. The string concatenation optimization still merges the allocation, constructor and `toString` calls and incorrectly re-wires the store to before the concatenation. As a result, passing `null` to the constructor will incorrectly increment the field before throwing a NullPointerException. With a debug build, we hit an assert in `StringConcat::validate_mem_flow` due to the unexpected field store. This is an old bug and an extreme edge case as javac would not generate such code. > > The following comment suggests that this case should be covered by `StringConcat::validate_control_flow()`: > https://github.com/openjdk/jdk/blob/3582fd9e93d9733c6fdf1f3848e0a093d44f6865/src/hotspot/share/opto/stringopts.cpp#L834-L838 > > However, the control flow analysis does not catch this case. I added the missing check. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 61e072d1 Author: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/61e072d11c8e0cb5879bb733ed1fdd2144326bfd Stats: 122 lines in 3 files changed: 122 ins; 0 del; 0 mod 8290705: StringConcat::validate_mem_flow asserts with "unexpected user: StoreI" Reviewed-by: kvn, xliu ------------- PR: https://git.openjdk.org/jdk/pull/9589 From duke at openjdk.org Wed Jul 27 10:58:02 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Wed, 27 Jul 2022 10:58:02 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 14:44:01 GMT, Andrew Haley wrote: >> Hi @theRealAph, >> I am sorry I did not get your comment. Could you please explain it? >> >> Thanks, >> Evgeny > > The addition is > 'PhaseOutput* phase_output = Compile::current()->output();' > then > 'phase_output != NULL && phase_output->in_scratch_emit_size()' > > so AFAICS `Compile::current()->output()` is now checked for null, where it was not before. Now I get it. Thank you. I agree this looks suspicious. I could not recall why I added it. Debugging helped me to find out. During the parsing phase of C2 compilation `ciTypeFlow::StateVector::do_invoke` causes `LinkResolver::resolve_static_call` which now has the following code: if (resolved_method->is_continuation_enter_intrinsic() && resolved_method->from_interpreted_entry() == NULL) { // does a load_acquire methodHandle mh(THREAD, resolved_method); // Generate a compiled form of the enterSpecial intrinsic. AdapterHandlerLibrary::create_native_wrapper(mh); } We generate a wrapper which is `nmethod` with trampoline calls. As we are in the parsing phase the output is not created. I can move `Compile::current()->output() != NULL` into the preceding IF and update the comment to the following: Make sure this is code generation of a C2 compilation when Compile::current()->output() is not NULL. C2 can generate native wrappers for the continuation enter intrinsic before code generation. C1 allocates space only for trampoline stubs generated by Call LIR ops. ------------- PR: https://git.openjdk.org/jdk/pull/9592 From duke at openjdk.org Wed Jul 27 11:25:22 2022 From: duke at openjdk.org (Sacha Coppey) Date: Wed, 27 Jul 2022 11:25:22 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V [v5] In-Reply-To: References: Message-ID: > This patch adds a partial JVMCI implementation for RISC-V, to allow using the GraalVM Native Image RISC-V LLVM backend, which does not use JVMCI for code emission. > It creates the jdk.vm.ci.riscv64 and jdk.vm.ci.hotspot.riscv64 packages, as well as implements a part of jvmciCodeInstaller_riscv64.cpp. To check for correctness, it enables JVMCI code installation tests on RISC-V. It will be tested soon in Native Image as well. Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: Add space in switch ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9587/files - new: https://git.openjdk.org/jdk/pull/9587/files/9f7cbf6c..8742b9b2 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9587&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9587.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9587/head:pull/9587 PR: https://git.openjdk.org/jdk/pull/9587 From duke at openjdk.org Wed Jul 27 11:25:26 2022 From: duke at openjdk.org (Sacha Coppey) Date: Wed, 27 Jul 2022 11:25:26 GMT Subject: RFR: 8290154: [JVMCI] Implement JVMCI for RISC-V [v4] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 19:11:23 GMT, Andrey Turbanov wrote: >> Sacha Coppey has updated the pull request incrementally with one additional commit since the last revision: >> >> Avoid using set_destination when call is not jal > > src/jdk.internal.vm.ci/share/classes/jdk.vm.ci.riscv64/src/jdk/vm/ci/riscv64/RISCV64Kind.java line 111: > >> 109: >> 110: public boolean isFP() { >> 111: switch(this) { > > let's add space after `switch` Sure! ------------- PR: https://git.openjdk.org/jdk/pull/9587 From adinn at openjdk.org Wed Jul 27 11:29:57 2022 From: adinn at openjdk.org (Andrew Dinn) Date: Wed, 27 Jul 2022 11:29:57 GMT Subject: RFR: 8290943: Fix several IR test issues on SVE after JDK-8289801 In-Reply-To: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> References: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> Message-ID: On Wed, 27 Jul 2022 07:04:25 GMT, Hao Sun wrote: > This patch was motivated by the test failure of > TestPopCountVectorLong.java on SVE with UseSVE=0 after JDK-8289801. > > Both CPU feature CPU_SVE and VM option UseSVE should be checked to > determine whether the SVE codegen part is available. > > Hence, 1) if one test case is designed to test multiple backends, we > should specify this full check in the "requires" annotation, e.g., > TestPopCoundVectorLong.java. 2) if one test case is designed to test > only SVE backend, we may also specify UseSVE option to "runWithFlags()" > in "main" function and only check CPU feature in "requires" annotation > part, e.g., VectorMaskedNotTest.java[1]. > > This test failure can be easily resolved via adding the full check in > the "requires" annotation. > > We further revisited all the SVE-oriented JTREG test cases and found two > potential issues. > > 1) AllBitsSetVectorMatchRuleTest.java > It's designed to test both NEON and SVE codegen. Since 1) ASIMD is the > mandatory feature on SVE platforms and 2) SVE codegen would be selected > by default on SVE platforms, it's no need to require SVE feature. > > 2) TestPopulateIndex.java > Similar to TestPopCountVectorLong.java, this case is expected to fail as > well on SVE with UseSVE=0. However, it didn't fail. The root cause is > that it's not correct to simply check the "PopulateIndex" string because > the test name i.e. TestPopulateIndex.java contains this string as well. > Instead, we turn to check IRNode name. > > Testing: > We ran the following vector test cases on 1) NEON-only platform, 2) SVE > platform with UseSVE=0 specified, 3) SVE platform with UseSVE not > specified(default on), 4) x64 platform. All the test cases passed. > > > hotspot:compiler/vectorapi > jdk:jdk/incubator/vector > hotspot:compiler/vectorization > > > [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java#L116 Good. ------------- Marked as reviewed by adinn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9653 From thartmann at openjdk.org Wed Jul 27 11:50:10 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 27 Jul 2022 11:50:10 GMT Subject: RFR: 8291002: Rename Method::build_interpreter_method_data to Method::build_profiling_method_data In-Reply-To: References: Message-ID: <7RdpDjKe-aSRWUceJijcvti-_PW3W7LIW3CWEklSpa4=.35e8ecba-e929-4a21-8d7b-5760bfdf3e6f@github.com> On Tue, 26 Jul 2022 08:45:59 GMT, Julian Waters wrote: > As mentioned in the review process for [JDK-8290834](https://bugs.openjdk.org/browse/JDK-8290834) `build_interpreter_method_data` is misleading because it is actually used for creating MethodData*s throughout HotSpot, not just in the interpreter. Renamed the method to `build_profiling_method_data` instead to more accurately describe what it is used for. Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9637 From jwaters at openjdk.org Wed Jul 27 11:56:02 2022 From: jwaters at openjdk.org (Julian Waters) Date: Wed, 27 Jul 2022 11:56:02 GMT Subject: Integrated: 8291002: Rename Method::build_interpreter_method_data to Method::build_profiling_method_data In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 08:45:59 GMT, Julian Waters wrote: > As mentioned in the review process for [JDK-8290834](https://bugs.openjdk.org/browse/JDK-8290834) `build_interpreter_method_data` is misleading because it is actually used for creating MethodData*s throughout HotSpot, not just in the interpreter. Renamed the method to `build_profiling_method_data` instead to more accurately describe what it is used for. This pull request has now been integrated. Changeset: 8ec31976 Author: Julian Waters Committer: Tobias Hartmann URL: https://git.openjdk.org/jdk/commit/8ec319768399ba83a3ac04c2034666216ebc9cba Stats: 12 lines in 9 files changed: 0 ins; 0 del; 12 mod 8291002: Rename Method::build_interpreter_method_data to Method::build_profiling_method_data Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/9637 From kvn at openjdk.org Wed Jul 27 15:10:00 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Jul 2022 15:10:00 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v10] In-Reply-To: References: Message-ID: <1GecXeb51hK7x9nrbXnhnAKlY_u5eDnMMapdJ8RoGN4=.471c903d-e659-4bb1-915b-b8615915dd84@github.com> On Tue, 26 Jul 2022 16:40:57 GMT, Vladimir Kozlov wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> replI_mem > > The testing of version 07 got failure when run vector tests with `-XX:UseAVX=0 -XX:UseSSE=2`: > > # Internal Error (/workspace/open/src/hotspot/share/opto/constantTable.cpp:217), pid=2750036, tid=2750067 > # assert((constant_addr - _masm.code()->consts()->start()) == con.offset()) failed: must be: 8 == 0 > > Current CompileTask: > C2: 287 29 % b compiler.codegen.TestByteVect::test_ci @ 2 (20 bytes) > > Stack: [0x00007f7abf144000,0x00007f7abf245000], sp=0x00007f7abf23fa30, free space=1006k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0xb731c8] ConstantTable::emit(CodeBuffer&) const+0x1c8 > V [libjvm.so+0x17c3673] PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x293 > V [libjvm.so+0xb191bb] Compile::Code_Gen()+0x42b > V [libjvm.so+0xb1e899] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1699 > > > and > > > # Internal Error (/workspace/open/src/hotspot/cpu/x86/assembler_x86.cpp:5095), pid=1431469, tid=1431493 > # Error: assert(VM_Version::supports_ssse3()) failed > > Current CompileTask: > C2: 468 240 % b 4 java.util.Arrays::fill @ 5 (21 bytes) > > Stack: [0x00007fdecd422000,0x00007fdecd523000], sp=0x00007fdecd51d8c0, free space=1006k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0x73079c] Assembler::pshufb(XMMRegisterImpl*, XMMRegisterImpl*)+0x13c > V [libjvm.so+0x4005d1] ReplB_regNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x1a1 > V [libjvm.so+0x17be04e] PhaseOutput::scratch_emit_size(Node const*)+0x45e > V [libjvm.so+0x17b4548] PhaseOutput::shorten_branches(unsigned int*)+0x2d8 > V [libjvm.so+0x17c6faa] PhaseOutput::Output()+0xcfa > V [libjvm.so+0xb191bb] Compile::Code_Gen()+0x42b > V [libjvm.so+0xb1e899] Compile::Compile(ciEnv*, ciMethod*, int, Options, DirectiveSet*)+0x1699 > @vnkozlov I have fixed those errors in the last commits. The second one is due to `pshufb` being supported only on ssse3 machines. And the first one is because the constant table itself is not aligned enough given that currently, it is only aligned at 8 bytes. I chose to avoid the problem and only emit constants requiring at most 8 bytes of alignment as this patch has already touched many areas. A proper solution would be in a separate issue. What do you think? > I agree with doing in separate changes. And I will start new testing. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Wed Jul 27 15:13:44 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 27 Jul 2022 15:13:44 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v3] In-Reply-To: References: Message-ID: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8287794 - 8287794: Review comments resolved. - 8287794: Reverse*VNode::Identity problem ------------- Changes: https://git.openjdk.org/jdk/pull/9623/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=02 Stats: 349 lines in 2 files changed: 322 ins; 25 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/9623.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9623/head:pull/9623 PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Wed Jul 27 15:16:30 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 27 Jul 2022 15:16:30 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Mon, 25 Jul 2022 14:22:29 GMT, Tobias Hartmann wrote: >> I don't think it would trigger a warning. The original warning, as I understand it, was to say that the *unpredicated* `else` branch is the same. So we were guaranteed to take either of branches, and thus the same code, irrelevant of the predicate. It is not the same here: we now have a third path, going out without entering either branch :) > > Ah, right. That makes sense. I still think the branches could me merged but I don't have a strong opinion. Due to expression short circuiting and GCSE an optimizing compiler should produce same code with merged branches. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Wed Jul 27 15:55:11 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 27 Jul 2022 15:55:11 GMT Subject: RFR: 8290034: Auto vectorize reverse bit operations. [v2] In-Reply-To: References: Message-ID: > Summary of changes: > - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. > - Handle auto-vectorization of Integer/Long.reverse bit operations. > - Backend implementation for these were added with 4th incubation of VectorAPIs. > > Following are performance number for newly added JMH mocro benchmarks:- > > > No-GFNI(CLX): > ============= > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 1.085 us/op > Longs.reverse 500 avgt 2 1.236 us/op > WithOpt: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.104 us/op > Longs.reverse 500 avgt 2 0.255 us/op > > With-GFNI(ICX): > =============== > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.887 us/op > Longs.reverse 500 avgt 2 1.095 us/op > > Without: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.037 us/op > Longs.reverse 500 avgt 2 0.145 us/op > > > Kindly review and share feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290034 - 8290034: Styling comments resolved. - 8290034: Adding descriptive comments. - 8290034: Auto vectorize reverse bit operations. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9535/files - new: https://git.openjdk.org/jdk/pull/9535/files/53157a29..864c4d09 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9535&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9535&range=00-01 Stats: 13282 lines in 450 files changed: 8194 ins; 3656 del; 1432 mod Patch: https://git.openjdk.org/jdk/pull/9535.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9535/head:pull/9535 PR: https://git.openjdk.org/jdk/pull/9535 From jbhateja at openjdk.org Wed Jul 27 16:13:57 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Wed, 27 Jul 2022 16:13:57 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v2] In-Reply-To: References: Message-ID: <_9ZVr02yz3qTVqVW2ti9CkG85TA46_189gzW3QT9QPA=.40ba701e-2f52-4fa1-8494-837d4f5be3e5@github.com> > Hi All, > > Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows > significant speedup. > > System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S) > > > Baseline: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms > > > With-opt: > ========= > Benchmark (size) Mode Cnt Score Error Units > RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms > RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms > > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322 - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9498/files - new: https://git.openjdk.org/jdk/pull/9498/files/8e80f639..71a8436e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9498&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9498&range=00-01 Stats: 16048 lines in 515 files changed: 10319 ins; 4047 del; 1682 mod Patch: https://git.openjdk.org/jdk/pull/9498.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9498/head:pull/9498 PR: https://git.openjdk.org/jdk/pull/9498 From shade at openjdk.org Wed Jul 27 17:02:12 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 27 Jul 2022 17:02:12 GMT Subject: RFR: 8291048: x86: compiler/c2/irTests/TestAutoVectorization2DArray.java fails with lower SSE In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 17:24:38 GMT, Aleksey Shipilev wrote: > [JDK-8289801](https://bugs.openjdk.org/browse/JDK-8289801) whitelisted the UseSSE/UseAVX flags, but missed update in the test. So when we test x86_32 with lower SSE, it fails, as no `LoadVector`/etc nodes are getting emitted. > > Additional testing: > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_32 fastdebug `c2/irTests` default (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` default (still passes) Thank you, I am integrating. ------------- PR: https://git.openjdk.org/jdk/pull/9646 From shade at openjdk.org Wed Jul 27 17:02:12 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 27 Jul 2022 17:02:12 GMT Subject: Integrated: 8291048: x86: compiler/c2/irTests/TestAutoVectorization2DArray.java fails with lower SSE In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 17:24:38 GMT, Aleksey Shipilev wrote: > [JDK-8289801](https://bugs.openjdk.org/browse/JDK-8289801) whitelisted the UseSSE/UseAVX flags, but missed update in the test. So when we test x86_32 with lower SSE, it fails, as no `LoadVector`/etc nodes are getting emitted. > > Additional testing: > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (now passes) > - [x] Linux x86_32 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_32 fastdebug `c2/irTests` default (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=0` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=1` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` with `-XX:UseAVX=0 -XX:UseSSE=2` (still passes) > - [x] Linux x86_64 fastdebug `c2/irTests` default (still passes) This pull request has now been integrated. Changeset: dc74ea21 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/dc74ea21f104f49c137476142b6f6340fd34af62 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8291048: x86: compiler/c2/irTests/TestAutoVectorization2DArray.java fails with lower SSE Reviewed-by: kvn, jiefu ------------- PR: https://git.openjdk.org/jdk/pull/9646 From kvn at openjdk.org Wed Jul 27 20:56:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 27 Jul 2022 20:56:44 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 09:40:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > unnecessary TEMP dst Got new failure (and testing still running). Test compiler/c2/cr7200264/TestSSE2IntVect.java failed with `-Xcomp`: java.lang.RuntimeException: Unexpected SubVI number: expected 2 >= 4 at jdk.test.lib.Asserts.fail(Asserts.java:594) at jdk.test.lib.Asserts.assertGreaterThanOrEqual(Asserts.java:288) at jdk.test.lib.Asserts.assertGTE(Asserts.java:259) at compiler.c2.cr7200264.TestDriver.verifyVectorizationNumber(TestDriver.java:65) at compiler.c2.cr7200264.TestDriver.run(TestDriver.java:43) at compiler.c2.cr7200264.TestSSE2IntVect.main(TestSSE2IntVect.java:48) ------------- PR: https://git.openjdk.org/jdk/pull/7832 From haosun at openjdk.org Wed Jul 27 23:39:39 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 23:39:39 GMT Subject: RFR: 8290943: Fix several IR test issues on SVE after JDK-8289801 In-Reply-To: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> References: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> Message-ID: On Wed, 27 Jul 2022 07:04:25 GMT, Hao Sun wrote: > This patch was motivated by the test failure of > TestPopCountVectorLong.java on SVE with UseSVE=0 after JDK-8289801. > > Both CPU feature CPU_SVE and VM option UseSVE should be checked to > determine whether the SVE codegen part is available. > > Hence, 1) if one test case is designed to test multiple backends, we > should specify this full check in the "requires" annotation, e.g., > TestPopCoundVectorLong.java. 2) if one test case is designed to test > only SVE backend, we may also specify UseSVE option to "runWithFlags()" > in "main" function and only check CPU feature in "requires" annotation > part, e.g., VectorMaskedNotTest.java[1]. > > This test failure can be easily resolved via adding the full check in > the "requires" annotation. > > We further revisited all the SVE-oriented JTREG test cases and found two > potential issues. > > 1) AllBitsSetVectorMatchRuleTest.java > It's designed to test both NEON and SVE codegen. Since 1) ASIMD is the > mandatory feature on SVE platforms and 2) SVE codegen would be selected > by default on SVE platforms, it's no need to require SVE feature. > > 2) TestPopulateIndex.java > Similar to TestPopCountVectorLong.java, this case is expected to fail as > well on SVE with UseSVE=0. However, it didn't fail. The root cause is > that it's not correct to simply check the "PopulateIndex" string because > the test name i.e. TestPopulateIndex.java contains this string as well. > Instead, we turn to check IRNode name. > > Testing: > We ran the following vector test cases on 1) NEON-only platform, 2) SVE > platform with UseSVE=0 specified, 3) SVE platform with UseSVE not > specified(default on), 4) x64 platform. All the test cases passed. > > > hotspot:compiler/vectorapi > jdk:jdk/incubator/vector > hotspot:compiler/vectorization > > > [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java#L116 Thanks for your review! ------------- PR: https://git.openjdk.org/jdk/pull/9653 From haosun at openjdk.org Wed Jul 27 23:44:50 2022 From: haosun at openjdk.org (Hao Sun) Date: Wed, 27 Jul 2022 23:44:50 GMT Subject: Integrated: 8290943: Fix several IR test issues on SVE after JDK-8289801 In-Reply-To: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> References: <_plA8ToDPu9Uwsf8mtxZ7UIN_JpjtXXIQMWMrGaI97s=.aacee138-76e5-48da-a01c-131e0c62f54b@github.com> Message-ID: On Wed, 27 Jul 2022 07:04:25 GMT, Hao Sun wrote: > This patch was motivated by the test failure of > TestPopCountVectorLong.java on SVE with UseSVE=0 after JDK-8289801. > > Both CPU feature CPU_SVE and VM option UseSVE should be checked to > determine whether the SVE codegen part is available. > > Hence, 1) if one test case is designed to test multiple backends, we > should specify this full check in the "requires" annotation, e.g., > TestPopCoundVectorLong.java. 2) if one test case is designed to test > only SVE backend, we may also specify UseSVE option to "runWithFlags()" > in "main" function and only check CPU feature in "requires" annotation > part, e.g., VectorMaskedNotTest.java[1]. > > This test failure can be easily resolved via adding the full check in > the "requires" annotation. > > We further revisited all the SVE-oriented JTREG test cases and found two > potential issues. > > 1) AllBitsSetVectorMatchRuleTest.java > It's designed to test both NEON and SVE codegen. Since 1) ASIMD is the > mandatory feature on SVE platforms and 2) SVE codegen would be selected > by default on SVE platforms, it's no need to require SVE feature. > > 2) TestPopulateIndex.java > Similar to TestPopCountVectorLong.java, this case is expected to fail as > well on SVE with UseSVE=0. However, it didn't fail. The root cause is > that it's not correct to simply check the "PopulateIndex" string because > the test name i.e. TestPopulateIndex.java contains this string as well. > Instead, we turn to check IRNode name. > > Testing: > We ran the following vector test cases on 1) NEON-only platform, 2) SVE > platform with UseSVE=0 specified, 3) SVE platform with UseSVE not > specified(default on), 4) x64 platform. All the test cases passed. > > > hotspot:compiler/vectorapi > jdk:jdk/incubator/vector > hotspot:compiler/vectorization > > > [1] https://github.com/openjdk/jdk/blob/master/test/hotspot/jtreg/compiler/vectorapi/VectorMaskedNotTest.java#L116 This pull request has now been integrated. Changeset: 16a12752 Author: Hao Sun Committer: Jie Fu URL: https://git.openjdk.org/jdk/commit/16a127524c78b85f17a13ba1072707bd9e851002 Stats: 8 lines in 4 files changed: 2 ins; 0 del; 6 mod 8290943: Fix several IR test issues on SVE after JDK-8289801 Reviewed-by: jiefu, adinn ------------- PR: https://git.openjdk.org/jdk/pull/9653 From xliu at openjdk.org Thu Jul 28 01:10:24 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 28 Jul 2022 01:10:24 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps In-Reply-To: <-WOTq68zVNUdFogau-S76PXvsDr0utYVDTdxjrDJF1c=.345530bc-8b7e-44fd-aa2b-d5d5a3e5db03@github.com> References: <-WOTq68zVNUdFogau-S76PXvsDr0utYVDTdxjrDJF1c=.345530bc-8b7e-44fd-aa2b-d5d5a3e5db03@github.com> Message-ID: On Tue, 26 Jul 2022 02:49:49 GMT, Vladimir Kozlov wrote: >> An unstable if trap is **superficial** if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial. >> >> The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code. >> >> For example, C2 generates an uncommon_trap for the else if cond is very likely true. >> >> public static int foo(boolean cond, int i) { >> Value x = new Value(0); >> Value y = new Value(1); >> Value z = new Value(i); >> >> if (cond) { >> i++; >> } >> return x._value + y._value + z._value + i; >> } >> >> >> If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this >> >> 037 call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') >> # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0 >> # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 } >> # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 } >> # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 } >> # OopMap {off=60/0x3c} >> 03c stop # ShouldNotReachHere >> >> >> Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly' >> >> <-XX:-OptimizeUnstableIf> >> Compiled method (c2) 346 17 4 SuperficialIfTrap::foo (53 bytes) >> total in heap [0x00007f50f4970910,0x00007f50f4970b70] = 608 >> relocation [0x00007f50f4970a70,0x00007f50f4970a80] = 16 >> main code [0x00007f50f4970a80,0x00007f50f4970ad8] = 88 >> stub code [0x00007f50f4970ad8,0x00007f50f4970af0] = 24 >> oops [0x00007f50f4970af0,0x00007f50f4970b00] = 16 >> metadata [0x00007f50f4970b00,0x00007f50f4970b08] = 8 >> scopes data [0x00007f50f4970b08,0x00007f50f4970b38] = 48 >> scopes pcs [0x00007f50f4970b38,0x00007f50f4970b68] = 48 >> dependencies [0x00007f50f4970b68,0x00007f50f4970b70] = 8 >> >> <-XX:+OptimizeUnstableIf> >> Compiled method (c2) 309 17 4 SuperficialIfTrap::foo (53 bytes) >> total in heap [0x00007f4090970910,0x00007f4090970b18] = 520 >> relocation [0x00007f4090970a70,0x00007f4090970a80] = 16 >> main code [0x00007f4090970a80,0x00007f4090970ac8] = 72 >> stub code [0x00007f4090970ac8,0x00007f4090970ae0] = 24 >> oops [0x00007f4090970ae0,0x00007f4090970ae8] = 8 >> scopes data [0x00007f4090970ae8,0x00007f4090970af0] = 8 >> scopes pcs [0x00007f4090970af0,0x00007f4090970b10] = 32 >> dependencies [0x00007f4090970b10,0x00007f4090970b18] = 8 > > src/hotspot/share/opto/parse1.cpp line 665: > >> 663: record_for_igvn(unc); >> 664: //tty->print("mark dead: "); >> 665: //unc->dump(); > > Debug lines left over? oh, sorry for that. ------------- PR: https://git.openjdk.org/jdk/pull/9601 From xliu at openjdk.org Thu Jul 28 01:35:31 2022 From: xliu at openjdk.org (Xin Liu) Date: Thu, 28 Jul 2022 01:35:31 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps In-Reply-To: <-WOTq68zVNUdFogau-S76PXvsDr0utYVDTdxjrDJF1c=.345530bc-8b7e-44fd-aa2b-d5d5a3e5db03@github.com> References: <-WOTq68zVNUdFogau-S76PXvsDr0utYVDTdxjrDJF1c=.345530bc-8b7e-44fd-aa2b-d5d5a3e5db03@github.com> Message-ID: On Tue, 26 Jul 2022 02:59:41 GMT, Vladimir Kozlov wrote: >> An unstable if trap is **superficial** if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial. >> >> The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code. >> >> For example, C2 generates an uncommon_trap for the else if cond is very likely true. >> >> public static int foo(boolean cond, int i) { >> Value x = new Value(0); >> Value y = new Value(1); >> Value z = new Value(i); >> >> if (cond) { >> i++; >> } >> return x._value + y._value + z._value + i; >> } >> >> >> If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this >> >> 037 call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') >> # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0 >> # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 } >> # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 } >> # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 } >> # OopMap {off=60/0x3c} >> 03c stop # ShouldNotReachHere >> >> >> Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly' >> >> <-XX:-OptimizeUnstableIf> >> Compiled method (c2) 346 17 4 SuperficialIfTrap::foo (53 bytes) >> total in heap [0x00007f50f4970910,0x00007f50f4970b70] = 608 >> relocation [0x00007f50f4970a70,0x00007f50f4970a80] = 16 >> main code [0x00007f50f4970a80,0x00007f50f4970ad8] = 88 >> stub code [0x00007f50f4970ad8,0x00007f50f4970af0] = 24 >> oops [0x00007f50f4970af0,0x00007f50f4970b00] = 16 >> metadata [0x00007f50f4970b00,0x00007f50f4970b08] = 8 >> scopes data [0x00007f50f4970b08,0x00007f50f4970b38] = 48 >> scopes pcs [0x00007f50f4970b38,0x00007f50f4970b68] = 48 >> dependencies [0x00007f50f4970b68,0x00007f50f4970b70] = 8 >> >> <-XX:+OptimizeUnstableIf> >> Compiled method (c2) 309 17 4 SuperficialIfTrap::foo (53 bytes) >> total in heap [0x00007f4090970910,0x00007f4090970b18] = 520 >> relocation [0x00007f4090970a70,0x00007f4090970a80] = 16 >> main code [0x00007f4090970a80,0x00007f4090970ac8] = 72 >> stub code [0x00007f4090970ac8,0x00007f4090970ae0] = 24 >> oops [0x00007f4090970ae0,0x00007f4090970ae8] = 8 >> scopes data [0x00007f4090970ae8,0x00007f4090970af0] = 8 >> scopes pcs [0x00007f4090970af0,0x00007f4090970b10] = 32 >> dependencies [0x00007f4090970b10,0x00007f4090970b18] = 8 > > Did you address @merykitty comment in RFE? You said: > `it looks like this JBS does have this downsize, I will investigate this problem` hi, @vnkozlov and @merykitty > Did you address @merykitty comment in RFE? You said: `it looks like this JBS does have this downsize, I will investigate this problem` I think about this. First of all, I admit that this does impact "peak" performance. In a nutshell, that is where tracing JIT is superior than method-based compilation. But c2 is a method-based with heroic optimizations. This change makes Java execution more predictable and less unstable_if traps. Secondly, c2 is adaptive. too_many_traps() and PerBytecodeTrapLimit both lead the final revision of nmethod to include both paths like this patch does, unless the real execution never take another path. Code like that is doubtful. I ran both SpecJVM2008 and Renaissance. I haven't seen difference in peak performance. When it comes to constant prorogation, I think it will have 2 positive outcomes. A constant can simplify control flow and reduce the strength of arithmetic computation. In the first case, actually, it won't change too much because c2 profiles branches and we still prune non-superficial paths on the basis of possibilities even though values are not constant. For the second one, I think it should make different. To mitigate it, I came up an improvement. I can't guarantee to detect all cases where merging hinders constant folding. It's because Constant Propagation happens in optimizer but this patch is working in parsing time. I think I can detect the simple case like this in 'UnstableIfTrap::suppress'. int i = x; if (cond) { i = 0; } When we attempt to create a phi node for i and we realize that the previous value is a constant, or ConI#0 in this case we give up suppressing. How about it? ------------- PR: https://git.openjdk.org/jdk/pull/9601 From duke at openjdk.org Thu Jul 28 02:55:44 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Thu, 28 Jul 2022 02:55:44 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 09:40:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > unnecessary TEMP dst It does not seem related as this patch has effects only after matching so it should not change the IR graph of the compilations ------------- PR: https://git.openjdk.org/jdk/pull/7832 From xgong at openjdk.org Thu Jul 28 02:58:38 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 02:58:38 GMT Subject: RFR: 8290034: Auto vectorize reverse bit operations. [v2] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 15:55:11 GMT, Jatin Bhateja wrote: >> Summary of changes: >> - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. >> - Handle auto-vectorization of Integer/Long.reverse bit operations. >> - Backend implementation for these were added with 4th incubation of VectorAPIs. >> >> Following are performance number for newly added JMH mocro benchmarks:- >> >> >> No-GFNI(CLX): >> ============= >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.reverse 500 avgt 2 1.085 us/op >> Longs.reverse 500 avgt 2 1.236 us/op >> WithOpt: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.reverse 500 avgt 2 0.104 us/op >> Longs.reverse 500 avgt 2 0.255 us/op >> >> With-GFNI(ICX): >> =============== >> Baseline: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.reverse 500 avgt 2 0.887 us/op >> Longs.reverse 500 avgt 2 1.095 us/op >> >> Without: >> Benchmark (size) Mode Cnt Score Error Units >> Integers.reverse 500 avgt 2 0.037 us/op >> Longs.reverse 500 avgt 2 0.145 us/op >> >> >> Kindly review and share feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290034 > - 8290034: Styling comments resolved. > - 8290034: Adding descriptive comments. > - 8290034: Auto vectorize reverse bit operations. Looks good to me! Thanks! ------------- Marked as reviewed by xgong (Committer). PR: https://git.openjdk.org/jdk/pull/9535 From xgong at openjdk.org Thu Jul 28 03:12:39 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 03:12:39 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v3] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 15:13:44 GMT, Jatin Bhateja wrote: >> Hi All, >> >> - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. >> - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. >> - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains three commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8287794 > - 8287794: Review comments resolved. > - 8287794: Reverse*VNode::Identity problem src/hotspot/share/opto/vectornode.cpp line 1859: > 1857: } > 1858: if (n->Opcode() == in1->Opcode()) { > 1859: // OperationV (OperationV X , MASK) , MASK => X Code style, suggest to: // (OperationV (OperationV X MASK) MASK) => X test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 80: > 78: > 79: @Test > 80: @IR(applyIfCPUFeature={"sve", "true"}, failOn = {"ReverseBytesV" , " > 0 "}) Use "IRNode.REVERSE_BYTES_V" instead? test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 118: > 116: > 117: @Test > 118: @IR(applyIfCPUFeatureOr={"sve", "true", "simd", "true", "avx2", "true"}, counts = {"ReverseBytesV" , " > 0 "}) Cpu feature "sve" contains "simd", so I think only keep `"simd", "true"` is fine. test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 156: > 154: > 155: @Test > 156: @IR(applyIfCPUFeature={"sve", "true"}, failOn = {"ReverseBytesV" , " > 0 "}) Can we test the IR with x86 avx-512 predicated feature instead of "sve"? SVE is different that we also need to specify "-XX:UseSVE=1" to make sure the predicated feature enabled. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Thu Jul 28 03:12:40 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 03:12:40 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v3] In-Reply-To: References: Message-ID: On Mon, 25 Jul 2022 07:48:28 GMT, Jatin Bhateja wrote: >> Yes, it's invalid on x86. So maybe you could add the limitation to the "requires", but seems this could make the codes complex. > > Correct. So for such cases, cpu feature "sve" contains "simd", so I think only keep `"simd", "true"` is fine. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Thu Jul 28 04:45:57 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 04:45:57 GMT Subject: Integrated: 8290034: Auto vectorize reverse bit operations. In-Reply-To: References: Message-ID: On Mon, 18 Jul 2022 08:01:09 GMT, Jatin Bhateja wrote: > Summary of changes: > - Intrinsify scalar bit reverse APIs to emit efficient instruction sequence for X86 targets with and w/o GFNI feature. > - Handle auto-vectorization of Integer/Long.reverse bit operations. > - Backend implementation for these were added with 4th incubation of VectorAPIs. > > Following are performance number for newly added JMH mocro benchmarks:- > > > No-GFNI(CLX): > ============= > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 1.085 us/op > Longs.reverse 500 avgt 2 1.236 us/op > WithOpt: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.104 us/op > Longs.reverse 500 avgt 2 0.255 us/op > > With-GFNI(ICX): > =============== > Baseline: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.887 us/op > Longs.reverse 500 avgt 2 1.095 us/op > > Without: > Benchmark (size) Mode Cnt Score Error Units > Integers.reverse 500 avgt 2 0.037 us/op > Longs.reverse 500 avgt 2 0.145 us/op > > > Kindly review and share feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 5d82d67a Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/5d82d67a9e1303e235f475c199eb1435c3d69006 Stats: 425 lines in 18 files changed: 425 ins; 0 del; 0 mod 8290034: Auto vectorize reverse bit operations. Reviewed-by: xgong, kvn ------------- PR: https://git.openjdk.org/jdk/pull/9535 From jbhateja at openjdk.org Thu Jul 28 05:41:42 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 05:41:42 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v4] In-Reply-To: References: Message-ID: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8287794: Review comments resolved. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9623/files - new: https://git.openjdk.org/jdk/pull/9623/files/26741a6d..6c2d5a4f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=02-03 Stats: 13 lines in 2 files changed: 0 ins; 0 del; 13 mod Patch: https://git.openjdk.org/jdk/pull/9623.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9623/head:pull/9623 PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Thu Jul 28 05:41:44 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 05:41:44 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v4] In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 03:08:28 GMT, Xiaohong Gong wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8287794: Review comments resolved. > > test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 156: > >> 154: >> 155: @Test >> 156: @IR(applyIfCPUFeature={"sve", "true"}, failOn = {"ReverseBytesV" , " > 0 "}) > > Can we test the IR with x86 avx-512 predicated feature instead of "sve"? SVE is different that we also need to specify "-XX:UseSVE=1" to make sure the predicated feature enabled. IR check is not applicable for AVX512, test point is added to cover transform for SVE which supports direct predicated vector instruction. Feature checks are much more strict and "sve" feature will be available only if UseSVE is set to 1. Thanks, removed redundant SVE check from IR annotations from other places. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Thu Jul 28 06:41:36 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 06:41:36 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v4] In-Reply-To: References: Message-ID: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> On Thu, 28 Jul 2022 05:36:52 GMT, Jatin Bhateja wrote: >> test/hotspot/jtreg/compiler/vectorapi/TestReverseByteTransforms.java line 156: >> >>> 154: >>> 155: @Test >>> 156: @IR(applyIfCPUFeature={"sve", "true"}, failOn = {"ReverseBytesV" , " > 0 "}) >> >> Can we test the IR with x86 avx-512 predicated feature instead of "sve"? SVE is different that we also need to specify "-XX:UseSVE=1" to make sure the predicated feature enabled. > > IR check is not applicable for AVX512, test point is added to cover transform for SVE which supports direct predicated vector instruction. Feature checks are much more strict and "sve" feature will be available only if UseSVE is set to 1. > > Thanks, removed redundant SVE check from IR annotations from other places. Thanks for updating the tests! I ran the test in our internal testing system, and unfortunately, this case will fail with "-XX:UseSVE=0" as expected. The reason is actually like what I said above. So to fix this, could you please: 1. Limit the whole test on aarch64 os systems by adding "`@requires vm.cpu.features ~= ".*simd.*"`" before the test. And limit the single IR test by adding "`applyIf={"UseSVE", ">0"}`" . OR 2. Fix the IR framework, to make "applyIf" and "applyIfCPUFeature" can co-work with each other. WDYT? ------------- PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Thu Jul 28 07:08:33 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 07:08:33 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v4] In-Reply-To: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> References: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> Message-ID: On Thu, 28 Jul 2022 06:37:52 GMT, Xiaohong Gong wrote: >> IR check is not applicable for AVX512, test point is added to cover transform for SVE which supports direct predicated vector instruction. Feature checks are much more strict and "sve" feature will be available only if UseSVE is set to 1. >> >> Thanks, removed redundant SVE check from IR annotations from other places. > > Thanks for updating the tests! I ran the test in our internal testing system, and unfortunately, this case will fail with "-XX:UseSVE=0" as expected. The reason is actually like what I said above. So to fix this, could you please: > > 1. Limit the whole test on aarch64 os systems by adding "`@requires vm.cpu.features ~= ".*simd.*"`" before the test. And limit the single IR test by adding "`applyIf={"UseSVE", ">0"}`" . > > OR > > 2. Fix the IR framework, to make "applyIf" and "applyIfCPUFeature" can co-work with each other. > > WDYT? I am not clear, if UseSVE=0 then why does "sve" feature getting populated during VM initialization? Entire handling of applyIfCPUFeature* is based on white box CPU feature API which queries the feature list populated during VM initialization, we are not directly queries the target features seen in /proc/cpuinfo. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Thu Jul 28 07:24:34 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 07:24:34 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v4] In-Reply-To: References: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> Message-ID: On Thu, 28 Jul 2022 07:04:44 GMT, Jatin Bhateja wrote: >> Thanks for updating the tests! I ran the test in our internal testing system, and unfortunately, this case will fail with "-XX:UseSVE=0" as expected. The reason is actually like what I said above. So to fix this, could you please: >> >> 1. Limit the whole test on aarch64 os systems by adding "`@requires vm.cpu.features ~= ".*simd.*"`" before the test. And limit the single IR test by adding "`applyIf={"UseSVE", ">0"}`" . >> >> OR >> >> 2. Fix the IR framework, to make "applyIf" and "applyIfCPUFeature" can co-work with each other. >> >> WDYT? > > I am not clear, if UseSVE=0 then why does "sve" feature getting populated during VM initialization? > Entire handling of applyIfCPUFeature* is based on white box CPU feature API which queries the feature list populated during VM initialization, we are not directly queries the target features seen in /proc/cpuinfo. Yeah, that's the difference from X86. `UseSVE` is a vm flag which lets user to choose whether use the sve feature or not. And the sve cpu feature is still there that will not been influenced by the VM flag. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Thu Jul 28 08:18:52 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 08:18:52 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v4] In-Reply-To: References: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> Message-ID: On Thu, 28 Jul 2022 07:20:51 GMT, Xiaohong Gong wrote: >> I am not clear, if UseSVE=0 then why does "sve" feature getting populated during VM initialization? >> Entire handling of applyIfCPUFeature* is based on white box CPU feature API which queries the feature list populated during VM initialization, we are not directly queries the target features seen in /proc/cpuinfo. > > Yeah, that's the difference from X86. `UseSVE` is a vm flag which lets user to choose whether use the sve feature or not. And the sve cpu feature is still there that will not been influenced by the VM flag. I think we should not be enabling sve feature if user explicitly passes VM flag UseSVE = 0, may be it should be fixed separately. Option 1 seems like a good interim solution to me, will update the patch. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From aph at openjdk.org Thu Jul 28 08:28:06 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 28 Jul 2022 08:28:06 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 10:54:26 GMT, Evgeny Astigeevich wrote: >> The addition is >> `PhaseOutput* phase_output = Compile::current()->output();` >> then >> `phase_output != NULL && phase_output->in_scratch_emit_size()` >> >> so AFAICS `Compile::current()->output()` is now checked for null, where it was not before. > > Now I get it. Thank you. > > I agree this looks suspicious. I could not recall why I added it. > Debugging helped me to find out. > During the parsing phase of C2 compilation `ciTypeFlow::StateVector::do_invoke` causes `LinkResolver::resolve_static_call` which now has the following code: > > if (resolved_method->is_continuation_enter_intrinsic() > && resolved_method->from_interpreted_entry() == NULL) { // does a load_acquire > methodHandle mh(THREAD, resolved_method); > // Generate a compiled form of the enterSpecial intrinsic. > AdapterHandlerLibrary::create_native_wrapper(mh); > } > > We generate a wrapper which is `nmethod` with trampoline calls. > As we are in the parsing phase the output is not created. > I can move `Compile::current()->output() != NULL` into the preceding IF and update the comment to the following: > > Make sure this is code generation of a C2 compilation when Compile::current()->output() is not NULL. > C2 can generate native wrappers for the continuation enter intrinsic before code generation. > C1 allocates space only for trampoline stubs generated by Call LIR ops. This is all rather complicated and obscure. It seems to me that passing a bool `check_emit_size` is exactly what we should do: it's more explicit and helps the reader. ------------- PR: https://git.openjdk.org/jdk/pull/9592 From aph at openjdk.org Thu Jul 28 08:54:44 2022 From: aph at openjdk.org (Andrew Haley) Date: Thu, 28 Jul 2022 08:54:44 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 20:43:48 GMT, Dean Long wrote: > Are you sure unextended_sp() returns the same thing as interpreter_frame_last_sp() on all platforms? I didn't think that was true for aarch64. Maybe what we need is a new shared API that will return what the continuation code expects, or promote interpreter_frame_last_sp() to be shared. Indeed. AArch64 makes a strong distinction between the machine SP and the interpreter's expression SP, and much of the difficulty with that port was because it was difficult to tell which was which in shared and x86 code. I wouldn't be able to guarantee that using unextended_sp() was always safe. ------------- PR: https://git.openjdk.org/jdk/pull/9411 From jbhateja at openjdk.org Thu Jul 28 09:04:02 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 09:04:02 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v5] In-Reply-To: References: Message-ID: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8287794: Separating out SVE related test point since currently VM sve feature is enabled based on CPU feature check even if UseSVE=0. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9623/files - new: https://git.openjdk.org/jdk/pull/9623/files/6c2d5a4f..969b4771 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9623&range=03-04 Stats: 227 lines in 2 files changed: 139 ins; 60 del; 28 mod Patch: https://git.openjdk.org/jdk/pull/9623.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9623/head:pull/9623 PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Thu Jul 28 09:05:53 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 09:05:53 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v5] In-Reply-To: References: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> Message-ID: <_zY6aH2JookMjREJlEM_FspGpDMqKoxnIi6P01y3I20=.090a38ac-5604-4fdd-80a7-d931c41addff@github.com> On Thu, 28 Jul 2022 07:20:51 GMT, Xiaohong Gong wrote: >> I am not clear, if UseSVE=0 then why does "sve" feature getting populated during VM initialization? >> Entire handling of applyIfCPUFeature* is based on white box CPU feature API which queries the feature list populated during VM initialization, we are not directly queries the target features seen in /proc/cpuinfo. > > Yeah, that's the difference from X86. `UseSVE` is a vm flag which lets user to choose whether use the sve feature or not. And the sve cpu feature is still there that will not been influenced by the VM flag. Hi @XiaohongGong , kindly check and approve. ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Thu Jul 28 09:44:55 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 09:44:55 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v5] In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 09:04:02 GMT, Jatin Bhateja wrote: >> Hi All, >> >> - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. >> - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. >> - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8287794: Separating out SVE related test point since currently VM sve feature is enabled based on CPU feature check even if UseSVE=0. Marked as reviewed by xgong (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/9623 From xgong at openjdk.org Thu Jul 28 09:44:56 2022 From: xgong at openjdk.org (Xiaohong Gong) Date: Thu, 28 Jul 2022 09:44:56 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v5] In-Reply-To: <_zY6aH2JookMjREJlEM_FspGpDMqKoxnIi6P01y3I20=.090a38ac-5604-4fdd-80a7-d931c41addff@github.com> References: <0UNSzXR_9X4VG8SPNvFBz7HQam6oiUwpQCGlFxHNnc8=.a0995af9-f288-4e44-bdc7-9465e4744e60@github.com> <_zY6aH2JookMjREJlEM_FspGpDMqKoxnIi6P01y3I20=.090a38ac-5604-4fdd-80a7-d931c41addff@github.com> Message-ID: On Thu, 28 Jul 2022 09:03:43 GMT, Jatin Bhateja wrote: >> Yeah, that's the difference from X86. `UseSVE` is a vm flag which lets user to choose whether use the sve feature or not. And the sve cpu feature is still there that will not been influenced by the VM flag. > > Hi @XiaohongGong , kindly check and approve. Tests pass, and thanks for the updating! ------------- PR: https://git.openjdk.org/jdk/pull/9623 From dlong at openjdk.org Thu Jul 28 09:51:17 2022 From: dlong at openjdk.org (Dean Long) Date: Thu, 28 Jul 2022 09:51:17 GMT Subject: RFR: 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() [v2] In-Reply-To: References: Message-ID: <5lyHlo5kKg2IkhloMvb1D6h4PYcBVXLobHstLkRls9E=.9a5fba5a-1170-4e50-83a2-bf52725d4bc7@github.com> On Wed, 27 Jul 2022 08:21:24 GMT, Richard Reingruber wrote: > I'd think any address within the frame is good for calling Continuation::get_continuation_entry_for_sp(). Given the fact that Continuation::is_frame_in_continuation() uses f.unextended_sp(), and it is called for interpreted frames, I would tentatively agree, but let's see what @pron says. ------------- PR: https://git.openjdk.org/jdk/pull/9411 From duke at openjdk.org Thu Jul 28 09:53:38 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 28 Jul 2022 09:53:38 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v3] In-Reply-To: References: Message-ID: > `trampoline_call` can do dummy code generation to calculate the size of C2 generated code. This is done in the output phase. In [src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042](https://github.com/openjdk/jdk/blob/e0d361cea91d3dd1450aece73f660b4abb7ce5fa/src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp#L1042) Loom code needed to generate a trampoline call outside of C2 and without the output phase. This caused test crashes. The project Loom added `trampoline_call1` to workaround the crashes. > > This PR improves detection of C2 output phase which makes `trampoline_call1` redundant. > > Tested the fastdebug/release builds: > - `'gtest`: Passed > - `tier1`...`tier2`: Passed Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision: Restore check_emit_size parameter ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9592/files - new: https://git.openjdk.org/jdk/pull/9592/files/e732890b..4b5953b7 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9592&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9592&range=01-02 Stats: 16 lines in 3 files changed: 1 ins; 4 del; 11 mod Patch: https://git.openjdk.org/jdk/pull/9592.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9592/head:pull/9592 PR: https://git.openjdk.org/jdk/pull/9592 From duke at openjdk.org Thu Jul 28 09:57:51 2022 From: duke at openjdk.org (Evgeny Astigeevich) Date: Thu, 28 Jul 2022 09:57:51 GMT Subject: RFR: 8287393: AArch64: Remove trampoline_call1 [v2] In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 08:25:46 GMT, Andrew Haley wrote: >> Now I get it. Thank you. >> >> I agree this looks suspicious. I could not recall why I added it. >> Debugging helped me to find out. >> During the parsing phase of C2 compilation `ciTypeFlow::StateVector::do_invoke` causes `LinkResolver::resolve_static_call` which now has the following code: >> >> if (resolved_method->is_continuation_enter_intrinsic() >> && resolved_method->from_interpreted_entry() == NULL) { // does a load_acquire >> methodHandle mh(THREAD, resolved_method); >> // Generate a compiled form of the enterSpecial intrinsic. >> AdapterHandlerLibrary::create_native_wrapper(mh); >> } >> >> We generate a wrapper which is `nmethod` with trampoline calls. >> As we are in the parsing phase the output is not created. >> I can move `Compile::current()->output() != NULL` into the preceding IF and update the comment to the following: >> >> Make sure this is code generation of a C2 compilation when Compile::current()->output() is not NULL. >> C2 can generate native wrappers for the continuation enter intrinsic before code generation. >> C1 allocates space only for trampoline stubs generated by Call LIR ops. > > This is all rather complicated and obscure. It seems to me that passing a bool `check_emit_size` is exactly what we should do: it's more explicit and helps the reader. I've restored `check_emit_size` and created an assert to guard it is properly used. I'll remove `cbuf` by fixing: https://bugs.openjdk.org/browse/JDK-8287394 ------------- PR: https://git.openjdk.org/jdk/pull/9592 From duke at openjdk.org Thu Jul 28 11:11:51 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Thu, 28 Jul 2022 11:11:51 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v2] In-Reply-To: <_9ZVr02yz3qTVqVW2ti9CkG85TA46_189gzW3QT9QPA=.40ba701e-2f52-4fa1-8494-837d4f5be3e5@github.com> References: <_9ZVr02yz3qTVqVW2ti9CkG85TA46_189gzW3QT9QPA=.40ba701e-2f52-4fa1-8494-837d4f5be3e5@github.com> Message-ID: On Wed, 27 Jul 2022 16:13:57 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Currently re-arrange over 512bit bytevector is optimized for targets supporting AVX512_VBMI feature, this patch generates efficient JIT sequence to handle it for AVX512BW targets. Following performance results with newly added benchmark shows >> significant speedup. >> >> System: Intel(R) Xeon(R) Platinum 8280 CPU @ 2.70GHz (CascadeLake 28C 2S) >> >> >> Baseline: >> ========= >> Benchmark (size) Mode Cnt Score Error Units >> RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16350.330 ops/ms >> RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 15991.346 ops/ms >> RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 34.423 ops/ms >> RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10873.348 ops/ms >> >> >> With-opt: >> ========= >> Benchmark (size) Mode Cnt Score Error Units >> RearrangeBytesBenchmark.testRearrangeBytes16 512 thrpt 2 16062.624 ops/ms >> RearrangeBytesBenchmark.testRearrangeBytes32 512 thrpt 2 16028.494 ops/ms >> RearrangeBytesBenchmark.testRearrangeBytes64 512 thrpt 2 8741.901 ops/ms >> RearrangeBytesBenchmark.testRearrangeBytes8 512 thrpt 2 10983.226 ops/ms >> >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322 > - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. Otherwise looks good to me. Thanks. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5568: > 5566: #endif > 5567: > 5568: void C2_MacroAssembler::rearrange_bytes(XMMRegister dst, XMMRegister shuffle, XMMRegister src, XMMRegister xtmp1, Can we use the same approach as that used for 256-bit vector. Something similar to: vpshufb(xtmp1, src, shuffle); // All elements are at the correct place modulo 16 vpxor(dst, dst, dst); vpslld(xtmp2, shuffle, 3); // Push the digit signifying the parity of 128-bit lane to the sign digit vpcmpb(ktmp, xtmp2, dst, lt); vshufi32x4(xtmp2, xtmp1, xtmp1, 0b10110001); // Shuffle the 128-bit lanes to get 1 - 0 - 3 - 2 vpblendmb(xtmp1, ktmp, xtmp1, xtmp2); // All elements are at the correct place modulo 32 vpslld(xtmp2, shuffle, 2); // Push the digit signifying the parity of 256-bit lane to the sign digit vpcmpb(ktmp, xtmp2, dst, lt); vshufi32x4(xtmp2, xtmp1, xtmp1, 0b01001110); // Shuffle the 128-bit lanes to get 2 - 3 - 0 - 1 vpblendmb(dst, ktmp, xtmp1, xtmp2); // All elements are at the correct place modulo 64 src/hotspot/cpu/x86/x86.ad line 1851: > 1849: } else if (size_in_bits == 256 && UseAVX < 2) { > 1850: return false; // Implementation limitation > 1851: } else if (is_subword_type(bt) && size_in_bits > 256 && !VM_Version::supports_avx512bw()) { This is not needed as a 512-bit subword type vector is only supported on avx512bw anyway. ------------- PR: https://git.openjdk.org/jdk/pull/9498 From jbhateja at openjdk.org Thu Jul 28 11:12:12 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 11:12:12 GMT Subject: RFR: 8287794: Reverse*VNode::Identity problem [v2] In-Reply-To: References: <3q1AbVHPhgTmbtzqVBmQtCsfrbbW64Kk6I8aUEJ0oTY=.ade9b2ce-f008-42b9-a3fc-ed69ba3580d1@github.com> Message-ID: On Tue, 26 Jul 2022 10:33:14 GMT, Tobias Hartmann wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8287794: Review comments resolved. > > All tests passed. Thanks @TobiHartmann , @XiaohongGong ------------- PR: https://git.openjdk.org/jdk/pull/9623 From jbhateja at openjdk.org Thu Jul 28 11:12:13 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 11:12:13 GMT Subject: Integrated: 8287794: Reverse*VNode::Identity problem In-Reply-To: References: Message-ID: On Sat, 23 Jul 2022 17:39:27 GMT, Jatin Bhateja wrote: > Hi All, > > - This bug fix patch fixes a missing case during reverse[bits|bytes] identity transformation. > - Unlike AARCH64(SVE), X86(AVX512) ISA has no direct instruction to reverse[bits|bytes] of a vector lane hence a predicated operation is supported through blend instruction. > - New IR framework based tests has been added to test transforms relevant to AVX2, AVX512 and SVE. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 471a427d Author: Jatin Bhateja URL: https://git.openjdk.org/jdk/commit/471a427d1023ee5948d9e58ba04ecabaa7a4db97 Stats: 428 lines in 3 files changed: 401 ins; 25 del; 2 mod 8287794: Reverse*VNode::Identity problem Reviewed-by: thartmann, xgong ------------- PR: https://git.openjdk.org/jdk/pull/9623 From bulasevich at openjdk.org Thu Jul 28 12:07:26 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 28 Jul 2022 12:07:26 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion Message-ID: This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). ------------- Commit messages: - 8291003: ARM32: constant_table.size assertion Changes: https://git.openjdk.org/jdk/pull/9672/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9672&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291003 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9672.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9672/head:pull/9672 PR: https://git.openjdk.org/jdk/pull/9672 From rrich at openjdk.org Thu Jul 28 12:09:11 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 28 Jul 2022 12:09:11 GMT Subject: RFR: 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le Message-ID: ProblemListing compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le It fails always. ------------- Commit messages: - 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le Changes: https://git.openjdk.org/jdk/pull/9674/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9674&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291479 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9674.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9674/head:pull/9674 PR: https://git.openjdk.org/jdk/pull/9674 From thartmann at openjdk.org Thu Jul 28 12:16:34 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Thu, 28 Jul 2022 12:16:34 GMT Subject: RFR: 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 11:56:24 GMT, Richard Reingruber wrote: > ProblemListing compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le > It fails always. Good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/9674 From goetz at openjdk.org Thu Jul 28 12:52:53 2022 From: goetz at openjdk.org (Goetz Lindenmaier) Date: Thu, 28 Jul 2022 12:52:53 GMT Subject: RFR: 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 11:56:24 GMT, Richard Reingruber wrote: > ProblemListing compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le > It fails always. LGTM I think you can integrate this to get our CI clean. ------------- Marked as reviewed by goetz (Reviewer). PR: https://git.openjdk.org/jdk/pull/9674 From rrich at openjdk.org Thu Jul 28 13:03:46 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 28 Jul 2022 13:03:46 GMT Subject: RFR: 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 12:13:13 GMT, Tobias Hartmann wrote: >> ProblemListing compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le >> It fails always. > > Good and trivial. Thanks for the reviews @TobiHartmann and @GoeLin . I shall integrate this trivial change later today. ------------- PR: https://git.openjdk.org/jdk/pull/9674 From rrich at openjdk.org Thu Jul 28 14:10:05 2022 From: rrich at openjdk.org (Richard Reingruber) Date: Thu, 28 Jul 2022 14:10:05 GMT Subject: Integrated: 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 11:56:24 GMT, Richard Reingruber wrote: > ProblemListing compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le > It fails always. This pull request has now been integrated. Changeset: 5214a17d Author: Richard Reingruber URL: https://git.openjdk.org/jdk/commit/5214a17d81b1cd3fbdaf90ffe4b37026e31d273d Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8291479: ProblemList compiler/rangechecks/TestRangeCheckHoistingScaledIV.java on ppc64le Reviewed-by: thartmann, goetz ------------- PR: https://git.openjdk.org/jdk/pull/9674 From shade at openjdk.org Thu Jul 28 14:11:20 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 28 Jul 2022 14:11:20 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 08:24:22 GMT, Boris Ulasevich wrote: > This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). This is equivalent to the following, right? int consts_size = consts_section->align_at_start(consts_section->size(), CodeBuffer::SECT_INSTS); In fact, we can make it better by doing e.g.: diff --git a/src/hotspot/cpu/arm/arm.ad b/src/hotspot/cpu/arm/arm.ad index a99d8599cd3..c8e4b482778 100644 --- a/src/hotspot/cpu/arm/arm.ad +++ b/src/hotspot/cpu/arm/arm.ad @@ -236,7 +236,8 @@ void MachConstantBaseNode::emit(CodeBuffer& cbuf, PhaseRegAlloc* ra_) const { Register r = as_Register(ra_->get_encode(this)); CodeSection* consts_section = __ code()->consts(); - int consts_size = consts_section->align_at_start(consts_section->size()); + // constants section size is aligned according to the align_at_start settings of the next section + int consts_size = CodeSection::align_at_start(consts_section->size(), CodeBuffer::SECT_INSTS); assert(constant_table.size() == consts_size, "must be: %d == %d", constant_table.size(), consts_size); // Materialize the constant table base. diff --git a/src/hotspot/share/asm/codeBuffer.hpp b/src/hotspot/share/asm/codeBuffer.hpp index e96f9c07e76..4cad3b4d30e 100644 --- a/src/hotspot/share/asm/codeBuffer.hpp +++ b/src/hotspot/share/asm/codeBuffer.hpp @@ -261,12 +261,12 @@ class CodeSection { // Slop between sections, used only when allocating temporary BufferBlob buffers. static csize_t end_slop() { return MAX2((int)sizeof(jdouble), (int)CodeEntryAlignment); } - csize_t align_at_start(csize_t off, int section) const { + static csize_t align_at_start(csize_t off, int section) { return (csize_t) align_up(off, alignment(section)); } csize_t align_at_start(csize_t off) const { - return (csize_t) align_up(off, alignment(_index)); + return align_at_start(off, _index); } // Ensure there's enough space left in the current section. ------------- PR: https://git.openjdk.org/jdk/pull/9672 From bulasevich at openjdk.org Thu Jul 28 15:13:48 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 28 Jul 2022 15:13:48 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 14:09:02 GMT, Aleksey Shipilev wrote: > This is equivalent to the following, right? Yes. I like your change in codeBuffer.hpp. Though your change in arm.ad brings a little bit too much details. And mine is shorter :) What do you think? - int consts_size = __ code()->insts()->align_at_start(consts_section->size()); + int consts_size = CodeSection::align_at_start(consts_section->size(), CodeBuffer::SECT_INSTS); ------------- PR: https://git.openjdk.org/jdk/pull/9672 From shade at openjdk.org Thu Jul 28 15:17:56 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 28 Jul 2022 15:17:56 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 08:24:22 GMT, Boris Ulasevich wrote: > This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). > CodeSection::align_at_start(consts_section->size(), CodeBuffer::SECT_INSTS); > > This is equivalent to the following, right? > > Yes. I like your change in codeBuffer.hpp. Though your change in arm.ad brings a little bit too much details. And mine is shorter :) What do you think? > > ``` > - int consts_size = __ code()->insts()->align_at_start(consts_section->size()); > + int consts_size = CodeSection::align_at_start(consts_section->size(), CodeBuffer::SECT_INSTS); > ``` What confused me is calling `code()->insts()->align_at_start`. I think explicitly calling out the section index as the `align_at_start` parameter is cleaner. But I don't care much, really. ------------- PR: https://git.openjdk.org/jdk/pull/9672 From shade at openjdk.org Thu Jul 28 15:33:37 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 28 Jul 2022 15:33:37 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 08:24:22 GMT, Boris Ulasevich wrote: > This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). This is fine as well, but consider making it a bit cleaner, as discussed. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/9672 From duke at openjdk.org Thu Jul 28 15:49:59 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Thu, 28 Jul 2022 15:49:59 GMT Subject: RFR: 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. [v2] In-Reply-To: References: <_9ZVr02yz3qTVqVW2ti9CkG85TA46_189gzW3QT9QPA=.40ba701e-2f52-4fa1-8494-837d4f5be3e5@github.com> Message-ID: On Thu, 28 Jul 2022 11:04:18 GMT, Quan Anh Mai wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8290322 >> - 8290322: Optimize Vector.rearrange over byte vectors for AVX512BW targets. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5568: > >> 5566: #endif >> 5567: >> 5568: void C2_MacroAssembler::rearrange_bytes(XMMRegister dst, XMMRegister shuffle, XMMRegister src, XMMRegister xtmp1, > > Can we use the same approach as that used for 256-bit vector. Something similar to: > > vpshufb(xtmp1, src, shuffle); // All elements are at the correct place modulo 16 > vpxor(dst, dst, dst); > vpslld(xtmp2, shuffle, 3); // Push the digit signifying the parity of 128-bit lane to the sign digit > vpcmpb(ktmp, xtmp2, dst, lt); > vshufi32x4(xtmp2, xtmp1, xtmp1, 0b10110001); // Shuffle the 128-bit lanes to get 1 - 0 - 3 - 2 > vpblendmb(xtmp1, ktmp, xtmp1, xtmp2); // All elements are at the correct place modulo 32 > vpslld(xtmp2, shuffle, 2); // Push the digit signifying the parity of 256-bit lane to the sign digit > vpcmpb(ktmp, xtmp2, dst, lt); > vshufi32x4(xtmp2, xtmp1, xtmp1, 0b01001110); // Shuffle the 128-bit lanes to get 2 - 3 - 0 - 1 > vpblendmb(dst, ktmp, xtmp1, xtmp2); // All elements are at the correct place modulo 64 Actually, it is my bad, this should not work. Sorry for the noise. ------------- PR: https://git.openjdk.org/jdk/pull/9498 From kvn at openjdk.org Thu Jul 28 16:09:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Jul 2022 16:09:41 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v9] In-Reply-To: References: Message-ID: On Tue, 19 Jul 2022 10:10:10 GMT, Andrew Haley wrote: >> All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. >> >> Here's an example of what was happening: >> >> ` rax->encoding();` >> >> Where rax is defined as `(Register *)0`. >> >> This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. >> >> >> typedef const RegisterImpl* Register; >> extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; >> inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; >> inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } >> constexpr Register rax = as_register(0); > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/x86/register_x86.hpp > > Co-authored-by: Aleksey Shipil?v I submitted our testing ------------- PR: https://git.openjdk.org/jdk/pull/9261 From kvn at openjdk.org Thu Jul 28 16:49:16 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Jul 2022 16:49:16 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 09:40:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > unnecessary TEMP dst I verified that the latest failure I posted is not related to these changes. There were no other failures. Approved. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Thu Jul 28 18:20:00 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Thu, 28 Jul 2022 18:20:00 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: <0TH2Cv2t4pTvoEZ9c4MLAtMqTWF2_tHYwFq-Z_pmbbQ=.5ab14625-26c7-4cb8-914e-51b3059a69fb@github.com> References: <0TH2Cv2t4pTvoEZ9c4MLAtMqTWF2_tHYwFq-Z_pmbbQ=.5ab14625-26c7-4cb8-914e-51b3059a69fb@github.com> Message-ID: On Tue, 26 Jul 2022 12:48:16 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 4388: >> >>> 4386: >>> 4387: void MacroAssembler::vallones(XMMRegister dst, int vector_len) { >>> 4388: // vpcmpeqd has special dependency treatment so it should be preferred to vpternlogd >> >> Comment is not clear, adding relevant reference will add more value. > > I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification. > > The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline). > > I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake > > pcmpeqd xmm0, xmm0 > paddd xmm0, xmm1 > paddd xmm0, xmm1 > paddd xmm0, xmm1 > > This version has the throughput of 4 cycles/iteration > > vpcmpeqd xmm0, xmm0, xmm0 > vpaddd xmm0, xmm1, xmm0 > vpaddd xmm0, xmm1, xmm0 > vpaddd xmm0, xmm1, xmm0 > > Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction. > > Thanks. Both the above JIT sequences have true dependency chain, there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From kvn at openjdk.org Thu Jul 28 19:05:41 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Jul 2022 19:05:41 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v9] In-Reply-To: References: Message-ID: <_tdb45lcoUeA-2Tg-QkXrzveandieU1TKTJ9TTcMRPs=.e26cd548-9181-4aa3-8f27-34ed3afac45b@github.com> On Tue, 19 Jul 2022 10:10:10 GMT, Andrew Haley wrote: >> All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. >> >> Here's an example of what was happening: >> >> ` rax->encoding();` >> >> Where rax is defined as `(Register *)0`. >> >> This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. >> >> >> typedef const RegisterImpl* Register; >> extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; >> inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; >> inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } >> constexpr Register rax = as_register(0); > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Update src/hotspot/cpu/x86/register_x86.hpp > > Co-authored-by: Aleksey Shipil?v linux-x64-debug all testing failed on AVX512 machines: # Internal Error (/workspace/open/src/hotspot/cpu/x86/register_x86.hpp:70), pid=25824, tid=25825 # assert(is_valid()) failed: invalid register # # JRE version: (20.0) (fastdebug build ) # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-internal-2022-07-28-1605371.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) # Problematic frame: # V [libjvm.so+0x71980d] Assembler::prefix(Address, RegisterImpl*, bool)+0x5d tack: [0x00007f7c13f7a000,0x00007f7c1407b000], sp=0x00007f7c14076c10, free space=1011k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x71980d] Assembler::prefix(Address, RegisterImpl*, bool)+0x5d V [libjvm.so+0x71b4ba] Assembler::movl(RegisterImpl*, Address)+0x8a V [libjvm.so+0x158b27f] MacroAssembler::aesctr_encrypt(RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*)+0x93f V [libjvm.so+0x19e7dbc] StubGenerator::generate_counterMode_VectorAESCrypt()+0x24c V [libjvm.so+0x19fcf75] StubGenerator::generate_all()+0x1eb5 V [libjvm.so+0x19c5554] StubGenerator_generate(CodeBuffer*, int)+0x54 V [libjvm.so+0x1a015bc] StubRoutines::initialize2()+0x8ac V [libjvm.so+0xfff7c7] init_globals()+0xd7 V [libjvm.so+0x1aef5ed] Threads::create_vm(JavaVMInitArgs*, bool*)+0x35d V [libjvm.so+0x11c1998] JNI_CreateJavaVM+0x98 Easy to reproduce with just `java -XX:UseAVX=3 t`. It passed with low level of UseAVX. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From bulasevich at openjdk.org Thu Jul 28 19:47:34 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Thu, 28 Jul 2022 19:47:34 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion [v2] In-Reply-To: References: Message-ID: > This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: align_at_start api clarification, use explicit section index in the expression ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9672/files - new: https://git.openjdk.org/jdk/pull/9672/files/eaa5cfb8..810a9433 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9672&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9672&range=00-01 Stats: 3 lines in 2 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.org/jdk/pull/9672.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9672/head:pull/9672 PR: https://git.openjdk.org/jdk/pull/9672 From kvn at openjdk.org Thu Jul 28 22:48:44 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 28 Jul 2022 22:48:44 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 19:54:11 GMT, Xin Liu wrote: > An unstable if trap is **superficial** if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial. > > The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code. > > For example, C2 generates an uncommon_trap for the else if cond is very likely true. > > public static int foo(boolean cond, int i) { > Value x = new Value(0); > Value y = new Value(1); > Value z = new Value(i); > > if (cond) { > i++; > } > return x._value + y._value + z._value + i; > } > > > If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this > > 037 call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') > # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0 > # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 } > # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 } > # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 } > # OopMap {off=60/0x3c} > 03c stop # ShouldNotReachHere > > > Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly' > > <-XX:-OptimizeUnstableIf> > Compiled method (c2) 346 17 4 SuperficialIfTrap::foo (53 bytes) > total in heap [0x00007f50f4970910,0x00007f50f4970b70] = 608 > relocation [0x00007f50f4970a70,0x00007f50f4970a80] = 16 > main code [0x00007f50f4970a80,0x00007f50f4970ad8] = 88 > stub code [0x00007f50f4970ad8,0x00007f50f4970af0] = 24 > oops [0x00007f50f4970af0,0x00007f50f4970b00] = 16 > metadata [0x00007f50f4970b00,0x00007f50f4970b08] = 8 > scopes data [0x00007f50f4970b08,0x00007f50f4970b38] = 48 > scopes pcs [0x00007f50f4970b38,0x00007f50f4970b68] = 48 > dependencies [0x00007f50f4970b68,0x00007f50f4970b70] = 8 > > <-XX:+OptimizeUnstableIf> > Compiled method (c2) 309 17 4 SuperficialIfTrap::foo (53 bytes) > total in heap [0x00007f4090970910,0x00007f4090970b18] = 520 > relocation [0x00007f4090970a70,0x00007f4090970a80] = 16 > main code [0x00007f4090970a80,0x00007f4090970ac8] = 72 > stub code [0x00007f4090970ac8,0x00007f4090970ae0] = 24 > oops [0x00007f4090970ae0,0x00007f4090970ae8] = 8 > scopes data [0x00007f4090970ae8,0x00007f4090970af0] = 8 > scopes pcs [0x00007f4090970af0,0x00007f4090970b10] = 32 > dependencies [0x00007f4090970b10,0x00007f4090970b18] = 8 I thought about this change more. You are trading performance lost with saving of some space in CodeCache. I don't think we should do this. C2's one of main optimization is class propagation based on profiling (checkcast). It allows significantly reduce following code if profiling shows only one class was observed. I am not sure if removal of "superficial" uncommon trap will not obstruct this and other similar optimizations. ------------- PR: https://git.openjdk.org/jdk/pull/9601 From duke at openjdk.org Fri Jul 29 03:47:34 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Fri, 29 Jul 2022 03:47:34 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: <0TH2Cv2t4pTvoEZ9c4MLAtMqTWF2_tHYwFq-Z_pmbbQ=.5ab14625-26c7-4cb8-914e-51b3059a69fb@github.com> Message-ID: On Thu, 28 Jul 2022 18:17:27 GMT, Jatin Bhateja wrote: >> I have remeasured the statement, it seems that only the non-vex encoding version receives the special dependency treatment, so I reverted this change and added a comment for clarification. >> >> The optimisation can be found noticed in [The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers](https://www.agner.org/optimize/) on several architectures such as in section 9.8 (Register allocation and renaming in Sandy Bridge and Ivy Bridge pipeline). >> >> I have performed measurements on uica.uops.info . While this sequence gives 1.37 cycles/iteration on Skylake and Icelake >> >> pcmpeqd xmm0, xmm0 >> paddd xmm0, xmm1 >> paddd xmm0, xmm1 >> paddd xmm0, xmm1 >> >> This version has the throughput of 4 cycles/iteration >> >> vpcmpeqd xmm0, xmm0, xmm0 >> vpaddd xmm0, xmm1, xmm0 >> vpaddd xmm0, xmm1, xmm0 >> vpaddd xmm0, xmm1, xmm0 >> >> Which indicates the `vpcmpeqd` failing to break dependencies on `xmm0` as opposed to the `pcmpeqd` instruction. >> >> Thanks. > > Both the above JIT sequences have true dependency chain, there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms. I'm sorry I don't quite understand what do you mean here, what I meant is that while `pcmpeqd xmmk, xmmk` is a dep-breaking idiom, `vpcmpeqd xmmk, xmmk, xmmk` seems to not be. As a result, I reverted that change and in this context, the only change is I added a branch for non-AVX machines. Please have a review for this patch. Thank you very much. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Fri Jul 29 05:20:38 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 05:20:38 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v8] In-Reply-To: References: <0TH2Cv2t4pTvoEZ9c4MLAtMqTWF2_tHYwFq-Z_pmbbQ=.5ab14625-26c7-4cb8-914e-51b3059a69fb@github.com> Message-ID: On Fri, 29 Jul 2022 03:44:16 GMT, Quan Anh Mai wrote: >> Both the above JIT sequences have true dependency chain, there is no scope of any additional architecture imposed false dependency doing any further perf degradation for which we use dep-breaking idioms. > > I'm sorry I don't quite understand what do you mean here, what I meant is that while `pcmpeqd xmmk, xmmk` is a dep-breaking idiom, `vpcmpeqd xmmk, xmmk, xmmk` seems to not be. As a result, I reverted that change and in this context, the only change is I added a branch for non-AVX machines. Please have a review for this patch. Thank you very much. Yes, its a valid one-idiom and as per section E.1.2 of [X86 Optimization manual](https://cdrdv2.intel.com/v1/dl/getContent/671488) such idioms are resolved by renamer and does not reach execution ports. I faintly remember that there was a subtle difference b/w handling of zeroing/one idioms on certain targets where in some cases one-idioms still go beyond renamer. But, we can keep this change of your since even if all-one idiom (vpcmpeqd) reach execution port, latency wise it's same as vpternlog over 256 bit vector. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From shade at openjdk.org Fri Jul 29 05:36:32 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 29 Jul 2022 05:36:32 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion [v2] In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 19:47:34 GMT, Boris Ulasevich wrote: >> This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > align_at_start api clarification, use explicit section index in the expression Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.org/jdk/pull/9672 From bulasevich at openjdk.org Fri Jul 29 06:22:26 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 29 Jul 2022 06:22:26 GMT Subject: RFR: 8291003: ARM32: constant_table.size assertion [v2] In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 19:47:34 GMT, Boris Ulasevich wrote: >> This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). > > Boris Ulasevich has updated the pull request incrementally with one additional commit since the last revision: > > align_at_start api clarification, use explicit section index in the expression I applied the changes to make it a bit cleaner. Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/9672 From bulasevich at openjdk.org Fri Jul 29 06:27:30 2022 From: bulasevich at openjdk.org (Boris Ulasevich) Date: Fri, 29 Jul 2022 06:27:30 GMT Subject: Integrated: 8291003: ARM32: constant_table.size assertion In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 08:24:22 GMT, Boris Ulasevich wrote: > This change fixes assertion condition as per the recent [JDK-8287373](https://bugs.openjdk.org/browse/JDK-8287373) change: the size of constants section is aligned up according to the settings of the next section (instructions section). This pull request has now been integrated. Changeset: 18cd16d2 Author: Boris Ulasevich URL: https://git.openjdk.org/jdk/commit/18cd16d2eae2ee624827eb86621f3a4ffd98fe8c Stats: 4 lines in 2 files changed: 1 ins; 0 del; 3 mod 8291003: ARM32: constant_table.size assertion Reviewed-by: shade ------------- PR: https://git.openjdk.org/jdk/pull/9672 From shade at openjdk.org Fri Jul 29 08:22:42 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 29 Jul 2022 08:22:42 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v9] In-Reply-To: <_tdb45lcoUeA-2Tg-QkXrzveandieU1TKTJ9TTcMRPs=.e26cd548-9181-4aa3-8f27-34ed3afac45b@github.com> References: <_tdb45lcoUeA-2Tg-QkXrzveandieU1TKTJ9TTcMRPs=.e26cd548-9181-4aa3-8f27-34ed3afac45b@github.com> Message-ID: On Thu, 28 Jul 2022 19:01:45 GMT, Vladimir Kozlov wrote: > linux-x64-debug all testing failed on AVX512 machines: > > ``` > # Internal Error (/workspace/open/src/hotspot/cpu/x86/register_x86.hpp:70), pid=25824, tid=25825 > # assert(is_valid()) failed: invalid register > # > # JRE version: (20.0) (fastdebug build ) > # Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 20-internal-2022-07-28-1605371.vladimir.kozlov.jdkgit, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x71980d] Assembler::prefix(Address, RegisterImpl*, bool)+0x5d > > tack: [0x00007f7c13f7a000,0x00007f7c1407b000], sp=0x00007f7c14076c10, free space=1011k > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0x71980d] Assembler::prefix(Address, RegisterImpl*, bool)+0x5d > V [libjvm.so+0x71b4ba] Assembler::movl(RegisterImpl*, Address)+0x8a > V [libjvm.so+0x158b27f] MacroAssembler::aesctr_encrypt(RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*, RegisterImpl*)+0x93f > V [libjvm.so+0x19e7dbc] StubGenerator::generate_counterMode_VectorAESCrypt()+0x24c > V [libjvm.so+0x19fcf75] StubGenerator::generate_all()+0x1eb5 > V [libjvm.so+0x19c5554] StubGenerator_generate(CodeBuffer*, int)+0x54 > V [libjvm.so+0x1a015bc] StubRoutines::initialize2()+0x8ac > V [libjvm.so+0xfff7c7] init_globals()+0xd7 > V [libjvm.so+0x1aef5ed] Threads::create_vm(JavaVMInitArgs*, bool*)+0x35d > V [libjvm.so+0x11c1998] JNI_CreateJavaVM+0x98 > ``` > > Easy to reproduce with just `java -XX:UseAVX=3 t`. It passed with low level of UseAVX. I believe it blew up because `MacroAssembler::aesctr_encrypt` implicitly cast `0` to `Register`. This seems to work: diff --git a/src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp b/src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp index 425087bdd31..c093174502c 100644 --- a/src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp +++ b/src/hotspot/cpu/x86/macroAssembler_x86_aes.cpp @@ -783,7 +783,7 @@ void MacroAssembler::avx_ghash(Register input_state, Register htbl, void MacroAssembler::aesctr_encrypt(Register src_addr, Register dest_addr, Register key, Register counter, Register len_reg, Register used, Register used_addr, Register saved_encCounter_start) { - const Register rounds = 0; + const Register rounds = rax; const Register pos = r12; Label PRELOOP_START, EXIT_PRELOOP, REMAINDER, REMAINDER_16, LOOP, END, EXIT, END_LOOP, (I assumed `rax` was `0` previously). ------------- PR: https://git.openjdk.org/jdk/pull/9261 From jbhateja at openjdk.org Fri Jul 29 08:27:47 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 08:27:47 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 09:40:45 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > unnecessary TEMP dst src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1651: > 1649: case 32: vmovdqu(dst, src); break; > 1650: case 64: evmovdqul(dst, src, Assembler::AVX_512bit); break; > 1651: default: ShouldNotReachHere(); No change in this file, may be you can remove it from change set. src/hotspot/cpu/x86/x86.ad line 4141: > 4139: instruct ReplB_mem(vec dst, memory mem) %{ > 4140: predicate(VM_Version::supports_avx2()); > 4141: match(Set dst (ReplicateB (LoadB mem))); Merge these rules and create a macro assembly routine for encoding block logic. src/hotspot/cpu/x86/x86.ad line 4159: > 4157: > 4158: instruct vReplS_reg(vec dst, rRegI src) %{ > 4159: predicate(UseAVX >= 2); Can be folded with below pattern, by pushing predicate into encoding block. src/hotspot/cpu/x86/x86.ad line 4188: > 4186: assert(vlen == 8, ""); > 4187: __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); > 4188: } Please move this into macro assembly routine, it will look cleaner that way, after merging with above rule. src/hotspot/cpu/x86/x86.ad line 4253: > 4251: int vlen_enc = vector_length_encoding(this); > 4252: if (VM_Version::supports_avx()) { > 4253: __ vbroadcastss($dst$$XMMRegister, addr, vlen_enc); Emitting vbroadcastss for all the vector sizes for Replicate[B/S/I] may result into domain switch over penalty, can be limited to only <=16 bytes replications and above that we can emit VPBROADCASTD. src/hotspot/cpu/x86/x86.ad line 4261: > 4259: __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); > 4260: } > 4261: } Please move into a new macro-assembly routine. src/hotspot/cpu/x86/x86.ad line 4407: > 4405: __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); > 4406: } > 4407: } Please move to a new macro assembly routine. src/hotspot/cpu/x86/x86.ad line 4497: > 4495: __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); > 4496: } > 4497: } Same as above. src/hotspot/cpu/x86/x86.ad line 4541: > 4539: instruct ReplD_reg(vec dst, vlRegD src) %{ > 4540: predicate(UseSSE < 3); > 4541: match(Set dst (ReplicateD src)); Pushing predicates into encoding can fold these patterns. src/hotspot/cpu/x86/x86.ad line 4579: > 4577: if (Matcher::vector_length_in_bytes(this) >= 16) { > 4578: __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); > 4579: } Macro-assembly routine. src/hotspot/share/opto/machnode.cpp line 478: > 476: // Stretching lots of inputs - don't do it. > 477: // A MachContant has the last input being the constant base > 478: if (req() > (is_MachConstant() ? 3U : 2U)) { Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Fri Jul 29 08:27:49 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 08:27:49 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: <-u0cr_joNl-C5Zu_27nHrdsDFqrUGKo0ygD32PhAwJU=.cd688fed-5606-4a92-b51e-d0cd99fffa6d@github.com> On Fri, 29 Jul 2022 07:51:04 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> unnecessary TEMP dst > > src/hotspot/share/opto/machnode.cpp line 478: > >> 476: // Stretching lots of inputs - don't do it. >> 477: // A MachContant has the last input being the constant base >> 478: if (req() > (is_MachConstant() ? 3U : 2U)) { > > Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check. Saving a spill at the cost of re-materialization using a comparatively cheaper instruction like add/sub/mul looks better for divD may be costly. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Fri Jul 29 08:27:49 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 08:27:49 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: <-u0cr_joNl-C5Zu_27nHrdsDFqrUGKo0ygD32PhAwJU=.cd688fed-5606-4a92-b51e-d0cd99fffa6d@github.com> References: <-u0cr_joNl-C5Zu_27nHrdsDFqrUGKo0ygD32PhAwJU=.cd688fed-5606-4a92-b51e-d0cd99fffa6d@github.com> Message-ID: On Fri, 29 Jul 2022 08:00:21 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/machnode.cpp line 478: >> >>> 476: // Stretching lots of inputs - don't do it. >>> 477: // A MachContant has the last input being the constant base >>> 478: if (req() > (is_MachConstant() ? 3U : 2U)) { >> >> Earlier some of the nodes like add/sub/mul/divF_imm which were carrying 3 inputs were not getting cloned, now with change we may see them getting rematerialized before uses which may increase code size but of course it will reduced interferences. With earlier cap of 2 only Replicates were passing this check. > > Saving a spill at the cost of re-materialization using a comparatively cheaper instruction like add/sub/mul looks better for divD may be costly. There are other machine nodes which just accept constants as a mode, like vround* and vcompu* nodes which will now qualify for rematerlization leading to emitting high cost instructions. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Fri Jul 29 08:27:50 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 08:27:50 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: <-u0cr_joNl-C5Zu_27nHrdsDFqrUGKo0ygD32PhAwJU=.cd688fed-5606-4a92-b51e-d0cd99fffa6d@github.com> Message-ID: On Fri, 29 Jul 2022 08:15:11 GMT, Jatin Bhateja wrote: >> Saving a spill at the cost of re-materialization using a comparatively cheaper instruction like add/sub/mul looks better for divD may be costly. > > There are other machine nodes which just accept constants as a mode, like vround* and vcompu* nodes which will now qualify for rematerlization leading to emitting high cost instructions. I think we should have a rough cost model here and not just basing it purely over connectivity of the node, or for the time being you can remove this change ? ------------- PR: https://git.openjdk.org/jdk/pull/7832 From aph at openjdk.org Fri Jul 29 09:04:56 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 29 Jul 2022 09:04:56 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v9] In-Reply-To: References: <_tdb45lcoUeA-2Tg-QkXrzveandieU1TKTJ9TTcMRPs=.e26cd548-9181-4aa3-8f27-34ed3afac45b@github.com> Message-ID: <2mhvEN9U6DA1R5LlTKbFxgvNmspNUnzt0Ateu7P03W0=.d9becf40-d6e0-4a4d-8523-3602a8088801@github.com> On Fri, 29 Jul 2022 08:17:53 GMT, Aleksey Shipilev wrote: > > I believe it blew up because `MacroAssembler::aesctr_encrypt` implicitly cast `0` to `Register`. This seems to work: Thanks for finding that. By the way, Vladimir Ivanov is working on a patch that will catch errors like this at build time. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From aph at openjdk.org Fri Jul 29 09:52:38 2022 From: aph at openjdk.org (Andrew Haley) Date: Fri, 29 Jul 2022 09:52:38 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v10] In-Reply-To: References: Message-ID: > All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. > > Here's an example of what was happening: > > ` rax->encoding();` > > Where rax is defined as `(Register *)0`. > > This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. > > > typedef const RegisterImpl* Register; > extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; > inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; > inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } > constexpr Register rax = as_register(0); Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: Fix MacroAssembler::aesctr_encrypt ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9261/files - new: https://git.openjdk.org/jdk/pull/9261/files/3006d36a..54caad46 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=09 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9261&range=08-09 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/9261.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9261/head:pull/9261 PR: https://git.openjdk.org/jdk/pull/9261 From duke at openjdk.org Fri Jul 29 13:48:07 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Fri, 29 Jul 2022 13:48:07 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v13] In-Reply-To: References: Message-ID: > Hi, > > This patch improves the generation of broadcasting a scalar in several ways: > > - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. > - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. > - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay > > With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Units Gain > SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% > SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% > SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% > SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% > > As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. > > This patch also removes some redundant code paths and renames some incorrectly named instructions. > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: add load_constant_vector ------------- Changes: - all: https://git.openjdk.org/jdk/pull/7832/files - new: https://git.openjdk.org/jdk/pull/7832/files/bc01c21b..e83ccaab Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=12 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=7832&range=11-12 Stats: 80 lines in 3 files changed: 35 ins; 36 del; 9 mod Patch: https://git.openjdk.org/jdk/pull/7832.diff Fetch: git fetch https://git.openjdk.org/jdk pull/7832/head:pull/7832 PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Fri Jul 29 13:48:13 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Fri, 29 Jul 2022 13:48:13 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: On Fri, 29 Jul 2022 05:24:19 GMT, Jatin Bhateja wrote: >> Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: >> >> unnecessary TEMP dst > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1651: > >> 1649: case 32: vmovdqu(dst, src); break; >> 1650: case 64: evmovdqul(dst, src, Assembler::AVX_512bit); break; >> 1651: default: ShouldNotReachHere(); > > No change in this file, may be you can remove it from change set. Since I added the method `C2_MacroAssembler::load_constant_vector` near here anyway I think this style change can be kept. > src/hotspot/cpu/x86/x86.ad line 4159: > >> 4157: >> 4158: instruct vReplS_reg(vec dst, rRegI src) %{ >> 4159: predicate(UseAVX >= 2); > > Can be folded with below pattern, by pushing predicate into encoding block. Aligning the predicate of the reg and the mem version allows the adlc parser to recognise their relationship and during register allocation can substitute a reg operation with a spilt operand with its corresponding mem node. You can see in the generated code the reg node has specific methods such as `cisc_operand` and `cisc_version` > src/hotspot/cpu/x86/x86.ad line 4253: > >> 4251: int vlen_enc = vector_length_encoding(this); >> 4252: if (VM_Version::supports_avx()) { >> 4253: __ vbroadcastss($dst$$XMMRegister, addr, vlen_enc); > > Emitting vbroadcastss for all the vector sizes for Replicate[B/S/I] may result into domain switch over penalty, can be limited to only <=16 bytes replications and above that we can emit VPBROADCASTD. Got it > src/hotspot/cpu/x86/x86.ad line 4261: > >> 4259: __ punpcklqdq($dst$$XMMRegister, $dst$$XMMRegister); >> 4260: } >> 4261: } > > Please move into a new macro-assembly routine. Done ------------- PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Fri Jul 29 14:00:28 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Fri, 29 Jul 2022 14:00:28 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v2] In-Reply-To: References: <1FBk3MauXFxUsyHz9kuhqGI-CtLRgHYmHn1eyyaDLvs=.6d4d94b0-32a0-42dc-a181-87df8d8f3b65@github.com> Message-ID: On Wed, 16 Mar 2022 17:25:53 GMT, Jatin Bhateja wrote: >>> Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel? 64 and IA-32 Architectures Optimization Reference Manual" >>> >>> > When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow. >>> >>> The manual mentions the guideline at section 3.5.2.2 >>> >>> ![image](https://user-images.githubusercontent.com/49088128/158618209-c0674ba7-1c93-4014-a7e1-330f4e5846da.png) >>> >>> Thanks. >> >> Thanks meant to refer to above text. I have removed incorrect reference. > >> > Hi, forwarding results within the same bypass domain does not result in delay, data bypass delay happens when the data crosses different domains, according to "Intel? 64 and IA-32 Architectures Optimization Reference Manual" >> > > When a source of a micro-op executed in one stack comes from a micro-op executed in another stack, a delay can occur. The delay occurs also for transitions between Intel SSE integer and Intel SSE floating-point operations. In some of the cases, the data transition is done using a micro-op that is added to the instruction flow. >> > >> > >> > The manual mentions the guideline at section 3.5.2.2 >> > ![image](https://user-images.githubusercontent.com/49088128/158618209-c0674ba7-1c93-4014-a7e1-330f4e5846da.png) >> > Thanks. >> >> Thanks meant to refer to above text. I have removed incorrect reference. > > It will still be good if we can come up with a micro benchmark, that shows the gain with the patch. @jatin-bhateja Thanks a lot for your comments, I have addressed those in the last commit. @vnkozlov Thanks very much for the review and testing. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From duke at openjdk.org Fri Jul 29 14:00:29 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Fri, 29 Jul 2022 14:00:29 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: <-u0cr_joNl-C5Zu_27nHrdsDFqrUGKo0ygD32PhAwJU=.cd688fed-5606-4a92-b51e-d0cd99fffa6d@github.com> Message-ID: On Fri, 29 Jul 2022 08:17:23 GMT, Jatin Bhateja wrote: >> There are other machine nodes which just accept constants as a mode, like vround* and vcompu* nodes which will now qualify for rematerlization leading to emitting high cost instructions. > > I think we should have a rough cost model here and not just basing it purely over connectivity of the node, or for the time being you can remove this change ? A node being decided to prefer rematerialising to spilling has to satisfy that: - The node is not explicitly said to be expensive, `divD` and `divF` fails at this stage. - The node declaration only contains simple register rules (explicit or implicit DEF dst and USE src), `vround` fails this because it has temp register, `cmpF_imm` and `cmpD_imm` fail this because they kill flags. - This method we are at agrees with the rematerialising. I have looked at all instances where `constantaddress` is used and found no node where accidental rematerialisation is inefficient. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From shade at openjdk.org Fri Jul 29 14:02:45 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 29 Jul 2022 14:02:45 GMT Subject: RFR: 8291559: x86: compiler/vectorization/TestReverseBitsVector.java fails Message-ID: See the bug report for reproducer. The test checks for `avx2`, but that is not enough for x86_32, as you can have `avx2` supported, but no `Reverse*` nodes emitted anyway. It seems the test is only viable on x86_64 anyway, so fix just gates the tests on that arch. Additional testing: - [x] Affected test on Linux x86_32, now skipped - [x] Affected test on Linux x86_64, still passes ------------- Commit messages: - Fix Changes: https://git.openjdk.org/jdk/pull/9685/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9685&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291559 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9685.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9685/head:pull/9685 PR: https://git.openjdk.org/jdk/pull/9685 From shade at openjdk.org Fri Jul 29 14:57:54 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Fri, 29 Jul 2022 14:57:54 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v9] In-Reply-To: <2mhvEN9U6DA1R5LlTKbFxgvNmspNUnzt0Ateu7P03W0=.d9becf40-d6e0-4a4d-8523-3602a8088801@github.com> References: <_tdb45lcoUeA-2Tg-QkXrzveandieU1TKTJ9TTcMRPs=.e26cd548-9181-4aa3-8f27-34ed3afac45b@github.com> <2mhvEN9U6DA1R5LlTKbFxgvNmspNUnzt0Ateu7P03W0=.d9becf40-d6e0-4a4d-8523-3602a8088801@github.com> Message-ID: On Fri, 29 Jul 2022 09:02:22 GMT, Andrew Haley wrote: > > I believe it blew up because `MacroAssembler::aesctr_encrypt` implicitly cast `0` to `Register`. This seems to work: > > Thanks for finding that. By the way, Vladimir Ivanov is working on a patch that will catch errors like this at build time. Yes, I wondered if it is possible to make compiler barf on such implicit conversion. Anyway, current patch seems to pass `tier1` and `tier2` on Linux x86_64 fastdebug on AVX-512 machine (Rocket Lake). ------------- PR: https://git.openjdk.org/jdk/pull/9261 From kvn at openjdk.org Fri Jul 29 17:31:45 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 29 Jul 2022 17:31:45 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v10] In-Reply-To: References: Message-ID: <3VV-Qho8_EcxPzIbx1KhYFRa4xkXbKpGFjfbiebvxFY=.e94757ef-2021-40ff-921d-9e14f3da27db@github.com> On Fri, 29 Jul 2022 09:52:38 GMT, Andrew Haley wrote: >> All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. >> >> Here's an example of what was happening: >> >> ` rax->encoding();` >> >> Where rax is defined as `(Register *)0`. >> >> This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. >> >> >> typedef const RegisterImpl* Register; >> extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; >> inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; >> inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } >> constexpr Register rax = as_register(0); > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Fix MacroAssembler::aesctr_encrypt I submitted new our testing. ------------- PR: https://git.openjdk.org/jdk/pull/9261 From kvn at openjdk.org Fri Jul 29 17:41:35 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 29 Jul 2022 17:41:35 GMT Subject: RFR: 8291559: x86: compiler/vectorization/TestReverseBitsVector.java fails In-Reply-To: References: Message-ID: <2aw1ORGNAgPmVHPXim40UJug441jYY3GjYGeRB2_ejw=.1c62338d-7560-4e72-87aa-622d3957322c@github.com> On Fri, 29 Jul 2022 13:49:59 GMT, Aleksey Shipilev wrote: > See the bug report for reproducer. > > The test checks for `avx2`, but that is not enough for x86_32, as you can have `avx2` supported, but no `Reverse*` nodes emitted anyway. It seems the test is only viable on x86_64 anyway, so fix just gates the tests on that arch. > > Additional testing: > - [x] Affected test on Linux x86_32, now skipped > - [x] Affected test on Linux x86_64, still passes Trivial. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9685 From jbhateja at openjdk.org Fri Jul 29 19:03:41 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 19:03:41 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v13] In-Reply-To: References: Message-ID: On Fri, 29 Jul 2022 13:48:07 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add load_constant_vector Marked as reviewed by jbhateja (Committer). ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Fri Jul 29 19:03:44 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 19:03:44 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: Message-ID: <6iKZmw4KDQl0pDrDHsfwotJgb4I_wpVSPKVjXcn9eHI=.125bb12d-b8eb-43b1-9591-68b1bc445e70@github.com> On Fri, 29 Jul 2022 13:39:31 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/x86.ad line 4159: >> >>> 4157: >>> 4158: instruct vReplS_reg(vec dst, rRegI src) %{ >>> 4159: predicate(UseAVX >= 2); >> >> Can be folded with below pattern, by pushing predicate into encoding block. > > Aligning the predicate of the reg and the mem version allows the adlc parser to recognise their relationship and during register allocation can substitute a reg operation with a spilt operand with its corresponding mem node. You can see in the generated code the reg node has specific methods such as `cisc_operand` and `cisc_version` May be a misplaced comment, what I meant was to collapse patterns if number and register class of operands comply. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From jbhateja at openjdk.org Fri Jul 29 19:03:46 2022 From: jbhateja at openjdk.org (Jatin Bhateja) Date: Fri, 29 Jul 2022 19:03:46 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v12] In-Reply-To: References: <-u0cr_joNl-C5Zu_27nHrdsDFqrUGKo0ygD32PhAwJU=.cd688fed-5606-4a92-b51e-d0cd99fffa6d@github.com> Message-ID: <4wrDjVMC9f4fR5Uue_RvEF8zQzhfZkswGVf67wXYTpM=.f93f9fc7-9ab4-4524-8287-8cb317fcf33e@github.com> On Fri, 29 Jul 2022 13:55:39 GMT, Quan Anh Mai wrote: >> I think we should have a rough cost model here and not just basing it purely over connectivity of the node, or for the time being you can remove this change ? > > A node being decided to prefer rematerialising to spilling has to satisfy that: > > - The node is not explicitly said to be expensive, `divD` and `divF` fails at this stage. > - The node declaration only contains simple register rules (explicit or implicit DEF dst and USE src), `vround` fails this because it has temp register, `cmpF_imm` and `cmpD_imm` fail this because they kill flags. > - This method we are at agrees with the rematerialising. > > I have looked at all instances where `constantaddress` is used and found no node where accidental rematerialisation is inefficient. Thanks for your explanations, I agree. ------------- PR: https://git.openjdk.org/jdk/pull/7832 From xliu at openjdk.org Fri Jul 29 19:56:41 2022 From: xliu at openjdk.org (Xin Liu) Date: Fri, 29 Jul 2022 19:56:41 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps In-Reply-To: References: Message-ID: On Thu, 28 Jul 2022 22:45:59 GMT, Vladimir Kozlov wrote: >> An unstable if trap is **superficial** if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial. >> >> The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code. >> >> For example, C2 generates an uncommon_trap for the else if cond is very likely true. >> >> public static int foo(boolean cond, int i) { >> Value x = new Value(0); >> Value y = new Value(1); >> Value z = new Value(i); >> >> if (cond) { >> i++; >> } >> return x._value + y._value + z._value + i; >> } >> >> >> If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this >> >> 037 call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') >> # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0 >> # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 } >> # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 } >> # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 } >> # OopMap {off=60/0x3c} >> 03c stop # ShouldNotReachHere >> >> >> Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly' >> >> <-XX:-OptimizeUnstableIf> >> Compiled method (c2) 346 17 4 SuperficialIfTrap::foo (53 bytes) >> total in heap [0x00007f50f4970910,0x00007f50f4970b70] = 608 >> relocation [0x00007f50f4970a70,0x00007f50f4970a80] = 16 >> main code [0x00007f50f4970a80,0x00007f50f4970ad8] = 88 >> stub code [0x00007f50f4970ad8,0x00007f50f4970af0] = 24 >> oops [0x00007f50f4970af0,0x00007f50f4970b00] = 16 >> metadata [0x00007f50f4970b00,0x00007f50f4970b08] = 8 >> scopes data [0x00007f50f4970b08,0x00007f50f4970b38] = 48 >> scopes pcs [0x00007f50f4970b38,0x00007f50f4970b68] = 48 >> dependencies [0x00007f50f4970b68,0x00007f50f4970b70] = 8 >> >> <-XX:+OptimizeUnstableIf> >> Compiled method (c2) 309 17 4 SuperficialIfTrap::foo (53 bytes) >> total in heap [0x00007f4090970910,0x00007f4090970b18] = 520 >> relocation [0x00007f4090970a70,0x00007f4090970a80] = 16 >> main code [0x00007f4090970a80,0x00007f4090970ac8] = 72 >> stub code [0x00007f4090970ac8,0x00007f4090970ae0] = 24 >> oops [0x00007f4090970ae0,0x00007f4090970ae8] = 8 >> scopes data [0x00007f4090970ae8,0x00007f4090970af0] = 8 >> scopes pcs [0x00007f4090970af0,0x00007f4090970b10] = 32 >> dependencies [0x00007f4090970b10,0x00007f4090970b18] = 8 > > I thought about this change more. You are trading performance lost with saving of some space in CodeCache. I don't think we should do this. > C2's one of main optimization is class propagation based on profiling (checkcast). It allows significantly reduce following code if profiling shows only one class was observed. I am not sure if removal of "superficial" uncommon trap will not obstruct this and other similar optimizations. hi, @vnkozlov > I thought about this change more. You are trading performance lost with saving of some space in CodeCache. I don't think we should do this. Thanks you for taking look this. you're right. I can provide some datapoints from my experiments. I do see that it can reduce some compilations. I ran Renaissance with `-Xlog:deoptimization=debug` and piped logs to `grep "level=4.*unstable_if" | cut -c 29- | sort -h | uniq | wc -l`. It counts the deoptimization events due to unstable_if. eg. `[debug][deoptimization] cid=1849 level=4 java.util.concurrent.ForkJoinPool.scan(Ljava/util/concurrent/ForkJoinPool$WorkQueue;II)I trap_bci=137 unstable_if reinterpret pc=0x00007fd6f0a5b790 relative_pc=0x0000000000000430` I found that this reduces 11%(median) deoptimzation of unstable_if. Unfortunately, those events are rare and they won't make much difference. Given the fact that the JIT compilers are both multi-threaded and concurrent, the overhead of JIT is super low. Secondly, I am surprised that hotspot is very responsive. it quickly recompiles deopt'ed methods with new information and replaces the old nmethod with a new revision which avoids uncommon_trap. I hardly observe codecache savings. To put them together, I have to concede this feature doesn't make sense. | benchmark | Before | After | diff | | ---------------- | ------ | ----- | -------- | | scrabble | 18 | 16 | -11.11% | | page-rank | 193 | 181 | -6.22% | | future-genetic | 38 | 38 | 0.00% | | akka-uct | 118 | 99 | -16.10% | | movie-lens | 198 | 185 | -6.57% | | scala-doku | 44 | 44 | 0.00% | | chi-square | 125 | 104 | -16.80% | | fj-kmeans | 26 | 19 | -26.92% | | rx-scrabble | 35 | 34 | -2.86% | | finagle-http | 173 | 126 | -27.17% | | reactors | 81 | 72 | -11.11% | | dec-tree | 200 | 170 | -15.00% | | scala-stm-bench7 | 70 | 66 | -5.71% | | naive-bayes | 171 | 144 | -15.79% | | als | 214 | 186 | -13.08% | | par-mnemonics | 19 | 19 | 0.00% | | scala-kmeans | 15 | 13 | -13.33% | | philosophers | 35 | 31 | -11.43% | | log-regression | 184 | 148 | -19.57% | | gauss-mix | 145 | 119 | -17.93% | | mnemonics | 13 | 13 | 0.00% | | dotty | 393 | 338 | -13.99% | | finagle-chirper | 268 | 264 | -1.49% | Speaking of "class propagation", do you mean `UseTypeSpeculation`? I will take a closer look at this feature. thanks, --lx ------------- PR: https://git.openjdk.org/jdk/pull/9601 From kvn at openjdk.org Fri Jul 29 22:43:47 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Fri, 29 Jul 2022 22:43:47 GMT Subject: RFR: 8287385: Suppress superficial unstable_if traps In-Reply-To: References: Message-ID: On Thu, 21 Jul 2022 19:54:11 GMT, Xin Liu wrote: > An unstable if trap is **superficial** if it can NOT prune any code. Sometimes, the else-section of program is empty. The superficial unstable_if traps not only complicate code shape but also consume codecache. C2 has to generate debuginfo for them. If the condition changed, HotSpot has to destroy the established nmethod and compile it again. Our analysis shows that rough 20% unstable_if traps are superficial. > > The algorithm which can identify and suppress superficial unstable if traps derives from its definition. A non-superficial unstable_if trap must prune some code. Parser skips parsing dead basic blocks(BBs). A trap is superficial if and only if its target BB is not dead! Or, it will be skipped(contradict from definition). As a result, we can suppress an unstable_if trap when c2 parse the target BB. This algorithm leaves alone those uncommon_traps do prune code. > > For example, C2 generates an uncommon_trap for the else if cond is very likely true. > > public static int foo(boolean cond, int i) { > Value x = new Value(0); > Value y = new Value(1); > Value z = new Value(i); > > if (cond) { > i++; > } > return x._value + y._value + z._value + i; > } > > > If we suppress this superficial unstable_if, the nmethod reduces from 608 bytes to 520 bytes, or -14.5%. Most of them come from "scopes data/pcs". It's because superficial unstable_if generates a trap like this > > 037 call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') > # SuperficialIfTrap::foo @ bci:29 (line 32) L[0]=_ L[1]=rsp + #4 L[2]=#ScObj0 L[3]=#ScObj1 L[4]=#ScObj2 STK[0]=rsp + #0 > # ScObj0 SuperficialIfTrap$Value={ [_value :0]=#0 } > # ScObj1 SuperficialIfTrap$Value={ [_value :0]=#1 } > # ScObj2 SuperficialIfTrap$Value={ [_value :0]=rsp + #4 } > # OopMap {off=60/0x3c} > 03c stop # ShouldNotReachHere > > > Here is the breakdown of nmethod, generated by '-XX:+PrintAssembly' > > <-XX:-OptimizeUnstableIf> > Compiled method (c2) 346 17 4 SuperficialIfTrap::foo (53 bytes) > total in heap [0x00007f50f4970910,0x00007f50f4970b70] = 608 > relocation [0x00007f50f4970a70,0x00007f50f4970a80] = 16 > main code [0x00007f50f4970a80,0x00007f50f4970ad8] = 88 > stub code [0x00007f50f4970ad8,0x00007f50f4970af0] = 24 > oops [0x00007f50f4970af0,0x00007f50f4970b00] = 16 > metadata [0x00007f50f4970b00,0x00007f50f4970b08] = 8 > scopes data [0x00007f50f4970b08,0x00007f50f4970b38] = 48 > scopes pcs [0x00007f50f4970b38,0x00007f50f4970b68] = 48 > dependencies [0x00007f50f4970b68,0x00007f50f4970b70] = 8 > > <-XX:+OptimizeUnstableIf> > Compiled method (c2) 309 17 4 SuperficialIfTrap::foo (53 bytes) > total in heap [0x00007f4090970910,0x00007f4090970b18] = 520 > relocation [0x00007f4090970a70,0x00007f4090970a80] = 16 > main code [0x00007f4090970a80,0x00007f4090970ac8] = 72 > stub code [0x00007f4090970ac8,0x00007f4090970ae0] = 24 > oops [0x00007f4090970ae0,0x00007f4090970ae8] = 8 > scopes data [0x00007f4090970ae8,0x00007f4090970af0] = 8 > scopes pcs [0x00007f4090970af0,0x00007f4090970b10] = 32 > dependencies [0x00007f4090970b10,0x00007f4090970b18] = 8 When I said "performance" I meant performance of compiled code. Uncommon trap (vs Phi) allows to narrow a value/type/class after check in the following code. j = 0; if (i = 3) { j = 5; } return j; with uncommon trap we generate: if (i != 3) { uncommon_trap; } return 5; Witch class check (checkcast) in the following code we can inline/calls/access fields only for checked class without additional checks. I am talking about generating class check with uncommon trap with [GraphKit::gen_checkcast](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/graphKit.cpp#L3175) Note, instead of just `return 5;` in my example there could be following code which use `j` extensively which could be replaced with constant `5`. ------------- PR: https://git.openjdk.org/jdk/pull/9601 From duke at openjdk.org Sat Jul 30 01:58:16 2022 From: duke at openjdk.org (SuperCoder79) Date: Sat, 30 Jul 2022 01:58:16 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition Message-ID: Hello, I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. * The removal of the memory load would have a beneficial effect in cache bound situations. * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. Thanks for your time, Jasmine ------------- Commit messages: - Fix whitespace in tests - Remove extra whitespace - Add ideal rule to convert floating point multiply by 2 into addition Changes: https://git.openjdk.org/jdk/pull/9642/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291336 Stats: 180 lines in 4 files changed: 180 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9642/head:pull/9642 PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Sat Jul 30 01:58:17 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Sat, 30 Jul 2022 01:58:17 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 15:39:42 GMT, SuperCoder79 wrote: > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine I leave some reviews for the patch, can you show the results of the added microbenchmark, please? Thanks There you go: https://bugs.openjdk.org/browse/JDK-8291336 Next time you could ask for help in the appropriate mailing list (this time it is hotspot-compiler-dev) or submit a bug through https://bugreport.java.com/bugreport/ Also please enable github action in your fork so that the patches get tested automatically at tier 1 on major platforms. Hope this helps. src/hotspot/share/opto/mulnode.cpp line 439: > 437: // Check to see if we are multiplying by a constant 2 and convert to add, then try the regular MulNode::Ideal > 438: Node *MulFNode::Ideal(PhaseGVN *phase, bool can_reshape) { > 439: const TypeF *t1 = in(1)->bottom_type()->isa_float_constant(); `phase->type(Node*)` refers to the type inferred by the GVN in this phase, while `Node::bottom_type()` refers to the loosest type the node can have. For example, the bottom type of an `AddINode` is always `TypeInt::INT` (every `int` value possible) while the GVN can ensure a stricter type if it knows both the inputs are integers between 0 and 10. In this case, you obtain the correct type nonetheless because `ConNode` extends `TypeNode`, a family of nodes which has their bottom types updated by the GVN. In general, in idealisation, it is more efficient to use `phase->type(Node*)`. src/hotspot/share/opto/mulnode.cpp line 442: > 440: const TypeF *t2 = in(2)->bottom_type()->isa_float_constant(); > 441: > 442: // x * 2 -> x + x Since constants are always pushed to the right of the expression, you don't need to try both permutations of the pattern. test/hotspot/jtreg/compiler/c2/irTests/TestMulBy2.java line 37: > 35: * @run driver compiler.c2.irTests.TestMulBy2 > 36: */ > 37: public class TestMulBy2 { Please use a more general name such as `MulFNodeIdealizationTests` ------------- PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Sat Jul 30 01:58:17 2022 From: duke at openjdk.org (SuperCoder79) Date: Sat, 30 Jul 2022 01:58:17 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 15:39:42 GMT, SuperCoder79 wrote: > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine Unfortunately it seems I can't open bugs on the JBS, is there a way to do so or will someone else have to do it for me? ------------- PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Sat Jul 30 03:04:16 2022 From: duke at openjdk.org (SuperCoder79) Date: Sat, 30 Jul 2022 03:04:16 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v2] In-Reply-To: References: Message-ID: > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: Apply changes from code review and improve benchmark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9642/files - new: https://git.openjdk.org/jdk/pull/9642/files/1448f25a..04706500 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=00-01 Stats: 134 lines in 4 files changed: 51 ins; 79 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/9642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9642/head:pull/9642 PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Sat Jul 30 03:04:16 2022 From: duke at openjdk.org (SuperCoder79) Date: Sat, 30 Jul 2022 03:04:16 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition In-Reply-To: References: Message-ID: On Tue, 26 Jul 2022 15:39:42 GMT, SuperCoder79 wrote: > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine Hi, thank you for your assistance with this, I have updated the PR title and have applied the changes from code review. I have also updated the benchmark and have attached the results below. I tested the benchmark on 2 systems, a new one and an old one. The new system has a Ryzen 5 4500U cpu, and the results are as shown: Baseline Patch Benchmark Mode Cnt Score Error Units Score Error Units TestMul2.testMul2Double avgt 10 209.740 ? 1.454 ns/op // 209.315 ? 1.116 ns/op (+0.20%) TestMul2.testMul2Float avgt 10 210.871 ? 6.179 ns/op // 209.498 ? 0.777 ns/op (+0.65%) The benchmark showed very little change on the new system, which is expected as the documentation states that both the `vaddsd` and `vmulsd` instructions have a latency of 3 cycles and a reciprocal throughput of 0.5. The slight gain could be from the elimination of the memory reference, or just from testing variance. The older system ran a Xeon x5690, and had these results: Baseline Patch Benchmark Mode Cnt Score Error Units Score Error Units TestMul2.testMul2Double avgt 10 190.062 ? 9.695 ns/op // 170.393 ? 1.193 ns/op (+10.34%) TestMul2.testMul2Float avgt 10 184.239 ? 1.983 ns/op // 171.329 ? 4.261 ns/op (+7.00%) Due to the older system having a faster addition than multiplication, especially with double precision operations, far more substantial gains were realized here. ------------- PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Sat Jul 30 03:04:17 2022 From: duke at openjdk.org (SuperCoder79) Date: Sat, 30 Jul 2022 03:04:17 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v2] In-Reply-To: References: Message-ID: On Wed, 27 Jul 2022 15:54:01 GMT, Quan Anh Mai wrote: >> SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: >> >> Apply changes from code review and improve benchmark > > src/hotspot/share/opto/mulnode.cpp line 439: > >> 437: // Check to see if we are multiplying by a constant 2 and convert to add, then try the regular MulNode::Ideal >> 438: Node *MulFNode::Ideal(PhaseGVN *phase, bool can_reshape) { >> 439: const TypeF *t1 = in(1)->bottom_type()->isa_float_constant(); > > `phase->type(Node*)` refers to the type inferred by the GVN in this phase, while `Node::bottom_type()` refers to the loosest type the node can have. For example, the bottom type of an `AddINode` is always `TypeInt::INT` (every `int` value possible) while the GVN can ensure a stricter type if it knows both the inputs are integers between 0 and 10. In this case, you obtain the correct type nonetheless because `ConNode` extends `TypeNode`, a family of nodes which has their bottom types updated by the GVN. In general, in idealisation, it is more efficient to use `phase->type(Node*)`. Thank you for this perspective! I have updated the code to use `phase->type()`. > src/hotspot/share/opto/mulnode.cpp line 442: > >> 440: const TypeF *t2 = in(2)->bottom_type()->isa_float_constant(); >> 441: >> 442: // x * 2 -> x + x > > Since constants are always pushed to the right of the expression, you don't need to try both permutations of the pattern. Done, thanks! > test/hotspot/jtreg/compiler/c2/irTests/TestMulBy2.java line 37: > >> 35: * @run driver compiler.c2.irTests.TestMulBy2 >> 36: */ >> 37: public class TestMulBy2 { > > Please use a more general name such as `MulFNodeIdealizationTests` Done ------------- PR: https://git.openjdk.org/jdk/pull/9642 From duke at openjdk.org Sat Jul 30 14:08:42 2022 From: duke at openjdk.org (SuperCoder79) Date: Sat, 30 Jul 2022 14:08:42 GMT Subject: RFR: 8291336: Add ideal rule to convert floating point multiply by 2 into addition [v3] In-Reply-To: References: Message-ID: <71KEXs5o-KZj0lV2nbIdUM7wHQ2meuZemigQohjls7I=.c0362ae2-3dae-4bd1-b8ef-755645421a4b@github.com> > Hello, > I would like to propose an ideal transform that converts floating point multiply by 2 (`x * 2`) into an addition operation instead. This would allow for the elimination of the memory reference for the constant two, and keep the whole operation inside registers. My justifications for this optimization include: > * As per [Agner Fog's instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) many older systems, such as the sandy bridge and ivy bridge architectures, have different latencies for addition and multiplication meaning this change could have beneficial effects when in hot code. > * The removal of the memory load would have a beneficial effect in cache bound situations. > * Multiplication by 2 is relatively common construct so this change can apply to a wide range of Java code. > > As this is my first time looking into the c2 codebase, I have a few lingering questions about my implementation and how certain parts of the compiler work. Mainly, is this patch getting the type of the operands correctly? I saw some cases where code used `bottom_type()` and other cases where it used `phase->type(value)`. Similarly, are nodes able to be reused as is being done in the AddNode constructors? I saw some places where the clone method was being used, but other places where it wasn't. > > I have attached an IR test and a jmh benchmark. Tier 1 testing passes on my machine. > > Thanks for your time, > Jasmine SuperCoder79 has updated the pull request incrementally with one additional commit since the last revision: Add bug tag to IR test ------------- Changes: - all: https://git.openjdk.org/jdk/pull/9642/files - new: https://git.openjdk.org/jdk/pull/9642/files/04706500..bce4263c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=9642&range=01-02 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/9642.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9642/head:pull/9642 PR: https://git.openjdk.org/jdk/pull/9642 From kvn at openjdk.org Sat Jul 30 20:04:57 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Sat, 30 Jul 2022 20:04:57 GMT Subject: RFR: 8289046: Undefined Behaviour in x86 class Assembler [v10] In-Reply-To: References: Message-ID: On Fri, 29 Jul 2022 09:52:38 GMT, Andrew Haley wrote: >> All instances of type Register exhibit UB in the form of wild pointer (including null pointer) dereferences. This isn't very hard to fix: we should make Registers pointers to something rather than aliases of small integers. >> >> Here's an example of what was happening: >> >> ` rax->encoding();` >> >> Where rax is defined as `(Register *)0`. >> >> This patch things so that rax is now defined as a pointer to the start of a static array of RegisterImpl. >> >> >> typedef const RegisterImpl* Register; >> extern RegisterImpl all_Registers[RegisterImpl::number_of_declared_registers + 1] ; >> inline constexpr Register RegisterImpl::first() { return all_Registers + 1; }; >> inline constexpr Register as_Register(int encoding) { return RegisterImpl::first() + encoding; } >> constexpr Register rax = as_register(0); > > Andrew Haley has updated the pull request incrementally with one additional commit since the last revision: > > Fix MacroAssembler::aesctr_encrypt Testing for latest version (09) passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/9261 From duke at openjdk.org Sun Jul 31 15:30:23 2022 From: duke at openjdk.org (Quan Anh Mai) Date: Sun, 31 Jul 2022 15:30:23 GMT Subject: RFR: 8283232: x86: Improve vector broadcast operations [v13] In-Reply-To: References: Message-ID: On Fri, 29 Jul 2022 13:48:07 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch improves the generation of broadcasting a scalar in several ways: >> >> - As it has been pointed out, dumping the whole vector into the constant table is costly in terms of code size, this patch minimises this overhead for vector replicate of constants. Also, options are available for constants to be generated with more alignment so that vector load can be made efficiently without crossing cache lines. >> - Vector broadcasting should prefer rematerialising to spilling when register pressure is high. >> - Load vectors using the same kind (integral vs floating point) of instructions as that of the results to avoid potential data bypass delay >> >> With this patch, the result of the added benchmark, which performs some operations with a really high register pressure, on my machine with Intel i7-7700HQ (avx2) is as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Units Gain >> SpiltReplicate.testDouble avgt 5 42.621 ? 0.598 38.771 ? 0.797 ns/op +9.03% >> SpiltReplicate.testFloat avgt 5 42.245 ? 1.464 38.603 ? 0.367 ns/op +8.62% >> SpiltReplicate.testInt avgt 5 20.581 ? 5.791 13.755 ? 0.375 ns/op +33.17% >> SpiltReplicate.testLong avgt 5 17.794 ? 4.781 13.663 ? 0.387 ns/op +23.22% >> >> As expected, the constant table sizes shrink significantly from 1024 bytes to 256 bytes for `long`/`double` and 128 bytes for `int`/`float` cases. >> >> This patch also removes some redundant code paths and renames some incorrectly named instructions. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > add load_constant_vector Thanks for your reviews. Does this PR need another run through the tests? ------------- PR: https://git.openjdk.org/jdk/pull/7832 From shade at openjdk.org Sun Jul 31 18:36:41 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Sun, 31 Jul 2022 18:36:41 GMT Subject: RFR: 8291559: x86: compiler/vectorization/TestReverseBitsVector.java fails In-Reply-To: References: Message-ID: <5BMk0TTBGCTCVGG7d75SmF_Wvb3H2rmjwTZVEYOAh5Q=.eb2b2c83-0848-4db1-82b7-c26ba0830487@github.com> On Fri, 29 Jul 2022 13:49:59 GMT, Aleksey Shipilev wrote: > See the bug report for reproducer. > > The test checks for `avx2`, but that is not enough for x86_32, as you can have `avx2` supported, but no `Reverse*` nodes emitted anyway. It seems the test is only viable on x86_64 anyway, so fix just gates the tests on that arch. > > Additional testing: > - [x] Affected test on Linux x86_32, now skipped > - [x] Affected test on Linux x86_64, still passes Thank you! ------------- PR: https://git.openjdk.org/jdk/pull/9685 From shade at openjdk.org Sun Jul 31 18:53:53 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Sun, 31 Jul 2022 18:53:53 GMT Subject: Integrated: 8291559: x86: compiler/vectorization/TestReverseBitsVector.java fails In-Reply-To: References: Message-ID: On Fri, 29 Jul 2022 13:49:59 GMT, Aleksey Shipilev wrote: > See the bug report for reproducer. > > The test checks for `avx2`, but that is not enough for x86_32, as you can have `avx2` supported, but no `Reverse*` nodes emitted anyway. It seems the test is only viable on x86_64 anyway, so fix just gates the tests on that arch. > > Additional testing: > - [x] Affected test on Linux x86_32, now skipped > - [x] Affected test on Linux x86_64, still passes This pull request has now been integrated. Changeset: acbe093a Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/acbe093a66d86904266e390c9dc5da2da34d8982 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8291559: x86: compiler/vectorization/TestReverseBitsVector.java fails Reviewed-by: kvn ------------- PR: https://git.openjdk.org/jdk/pull/9685