From jiefu at openjdk.java.net Mon May 2 02:00:12 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 2 May 2022 02:00:12 GMT Subject: RFR: 8285980: Several tests in compiler/c2/irTests miss @requires vm.compiler2.enabled Message-ID: <_EgVjIjyFG1tQHsjX6smVcBCPqfwRhVSukyQngQcktc=.d2599deb-5ff9-4aca-a384-6d0e132024c7@github.com> Hi all, Several tests in compiler/c2/irTests fail if C2 is not available. The following 7 tests use C2-only VM falgs. compiler/c2/irTests/TestSkeletonPredicates.java compiler/c2/irTests/TestFewIterationsCountedLoop.java compiler/c2/irTests/TestDuplicateBackedge.java compiler/c2/irTests/TestStripMiningDropsSafepoint.java compiler/c2/irTests/TestCountedLoopSafepoint.java compiler/c2/irTests/TestLongRangeChecks.java compiler/c2/irTests/TestSuperwordFailsUnrolling.java The following two tests assert that C2 must be available. compiler/c2/irTests/TestIRLShiftIdeal_XPlusX_LShiftC.java compiler/c2/irTests/TestIRAddIdealNotXPlusC.java It would be better to add `@requires vm.compiler2.enabled` to these tests. Thanks. Best regards, Jie ------------- Commit messages: - 8285980: Several tests in compiler/c2/irTests miss @requires vm.compiler2.enabled Changes: https://git.openjdk.java.net/jdk/pull/8495/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8495&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285980 Stats: 9 lines in 9 files changed: 9 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8495.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8495/head:pull/8495 PR: https://git.openjdk.java.net/jdk/pull/8495 From rcastanedalo at openjdk.java.net Mon May 2 07:02:32 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 07:02:32 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 17:46:36 GMT, Vladimir Kozlov wrote: > Good. Thanks for reviewing, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8464 From duke at openjdk.java.net Mon May 2 07:07:21 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 07:07:21 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering Message-ID: I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `root` and `target`. With the `filter` string one can easily select which node types to traverse. `void print_bfs(Node* root, const uint max_distance, Node* target, const char* filter)` While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. Please let me know if you would find this helpful, or if you have any feedback to improve it. Thanks, Emanuel **1. Better dump()** The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root. 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Example: (rr) p print_bfs(find_node(7011), 2, 0, "cdmox+") No target: perform BFS. dis par dir dump ------------------------------------------- 0 7011 +d 7011 Bool === _ 7010 [[ 7012 ]] [gt] 1 7011 +d 7010 CmpI === _ 216 213 [[ 7011 ]] 2 7010 +d 216 AddI === _ 385 386 [[ 9066 7020 7010 385 ]] 2 7010 +d 213 Phi === 6987 378 6982 [[ 278 .... ]] Example with Mach nodes: (rr) p print_bfs(load, 2, 0, "cdmox+#OB") No target: perform BFS. dis head idom dep old par dir dump ------------------------------------------- 0 314 315 18 o260 144 +d 144 loadF === 314 136 143 [[ 138 270 ]] 1 314 315 18 _ 144 +c 314 Region === 314 97 [[ 314 95 144 ]] 1 316 317 16 o305 144 +m 136 MachProj === 105 [[ 135 137 143 144 139 145 190 221 240 258 ]] 1 316 317 16 o285 144 +d 143 loadN === _ 136 69 [[ 137 144 103 100 202 ]] 2 315 316 17 o311 314 +c 97 IfTrue === 98 [[ 314 ]] 2 316 317 16 o306 136 +x 105 membar_acquire === 316 0 147 0 0 [[ 136 104 ]] 2 49 1 2 o10 143 +d 69 MachProj === 49 [[ 67 132 134 143 145 146 151 169 50 4 247 ]] **2. Find loop body** When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. `print_bfs(loop_end, 20, loop_head, "cox+")` This provides us with a shortest path, given this path has a distance of at most 20. Example: (rr) p print_bfs(find_node(357), 30, find_node(587), "cox+") Find shortest path: 357 -> 587. Backtrace target. disdir dump ------------------------------------- 21 +c 587 CountedLoop === 587 361 121 [[ 587 589 603 604 ]] 20 +x 589 MemBarAcquire === 587 1 603 1 1 [[ 588 590 ]] 19 +c 590 Proj === 589 [[ 591 ]] 18 +c 591 If === 590 574 [[ 592 605 ]] 17 +c 592 IfTrue === 591 [[ 576 593 ]] 16 +c 593 RangeCheck === 592 566 [[ 594 607 ]] ... 2 +c 296 RangeCheck === 295 272 [[ 299 316 ]] 1 +c 299 IfTrue === 296 [[ 258 265 357 ]] 0 c 357 CountedLoopEnd === 299 356 [[ 358 121 ]] [lt] ------------- Commit messages: - some white spaces fixed - fixing up root in shortest path backtracking - 8283775: VM support for graph querying in debugger with BFS traversal and node filtering Changes: https://git.openjdk.java.net/jdk/pull/8468/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8283775 Stats: 295 lines in 1 file changed: 295 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From rcastanedalo at openjdk.java.net Mon May 2 07:08:44 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 07:08:44 GMT Subject: RFR: 8283684: IGV: speed up filter application In-Reply-To: References: Message-ID: On Fri, 1 Apr 2022 10:19:21 GMT, Roberto Casta?eda Lozano wrote: > This change improves view creation time by creating a single JavaScript engine shared among all filters, rather than creating an engine every time a filter is applied. Since creating a JavaScript engine is a costly operation, this change speeds up view creation substantially for small and medium-sized graphs as soon as any filter is applied. This includes the default IGV configuration, where the "Color by category" filter is enabled. > > #### Testing > > ##### Functionality > > - Tested manually applying different filter subsets and differing on a small selection of graphs, for JDK 11 and 17 (which use different versions and ways of packaging the JavaScript engine). > > - Tested automatically viewing thousands of graphs with different subsets of filters enabled (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the view creation time for the default sea-of-nodes view on a selection of 94 medium-sized graphs (200-493 nodes) for different subsets of filters. Before the change, the view creating time increases roughly linearly with the number of applied filters (since an engine is created for each filter application). After the change, the view creating time remains roughly constant (even slightly decreasing) as the number of applied filters increases, yielding an average speedup of 2.4x for the default IGV configuration, and up to 8.2x when five filters are applied. The speedup is expected to diminish for larger graphs where engine creation does not dominate view creation time. The complete results are [attached](https://github.com/openjdk/jdk/files/8396784/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). Anyone willing to review this? ------------- PR: https://git.openjdk.java.net/jdk/pull/8073 From rcastanedalo at openjdk.java.net Mon May 2 07:09:45 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 07:09:45 GMT Subject: RFR: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly [v3] In-Reply-To: <513d1tjIppunmhLbocDLf5cInbEjISxbBT7x03i75vs=.fcba822b-9acc-4872-b7a3-68286e6c9515@github.com> References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> <513d1tjIppunmhLbocDLf5cInbEjISxbBT7x03i75vs=.fcba822b-9acc-4872-b7a3-68286e6c9515@github.com> Message-ID: <36wtqVOK1kmn-XTteGWMJZWtTsCWJ0aD-DkCBtP1dZA=.89789094-010e-4031-a720-f2a7f3900d8c@github.com> On Fri, 29 Apr 2022 19:37:31 GMT, Vladimir Kozlov wrote: > Approved. Thanks, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/7493 From thartmann at openjdk.java.net Mon May 2 07:24:37 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 07:24:37 GMT Subject: RFR: 8265360: several compiler/whitebox tests fail with "private compiler.whitebox.SimpleTestCaseHelper(int) must be compiled" In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 21:13:21 GMT, Igor Veresov wrote: > The compilation policy uses the length of the queues as a feedback mechanism that gives us information about the compilation speed. In some places it makes decisions based on the queue length length alone without looking at the invocation counters. That can cause a starvation effect. For example when running in a C2-only mode it may delay profiling in the interpreter if the C2 queue is too long. The solution to this is detect "old" methods (that is method that have been used a lot) and force putting them into the queue and let the queue prioritization deal with it. > > I also did some cleanup for things that got in the way. > Testing looks clean. Looks good to me. Can [JDK-8282032](https://bugs.openjdk.java.net/browse/JDK-8282032) now be closed as duplicate or is it a different issue? ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8473 From thartmann at openjdk.java.net Mon May 2 07:42:51 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 07:42:51 GMT Subject: RFR: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize [v7] In-Reply-To: <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> Message-ID: On Thu, 28 Apr 2022 10:43:59 GMT, Aleksey Shipilev wrote: >> See the discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `java/lang/instrument` >> - [x] Linux x86_32 fastdebug `java/lang/instrument` >> - [x] Linux AArch64 fastdebug `java/lang/instrument` >> - [x] Linux ARM32 fastdebug `java/lang/instrument` >> - [x] Linux PPC64 fastdebug `java/lang/instrument` >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux AArch64 fastdebug `tier1` >> - [x] Linux PPC64 fastdebug `tier1` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Fix RISC-V too > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Revert ARM32 checks > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Fixing failures in ARM32 > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Checking ARM32 code > - Use checked_cast > - Merge branch 'master' into JDK-8280003-c1-logical-and > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/7f83a34a...66448a5e Sounds good, thanks for the clarifications. ------------- PR: https://git.openjdk.java.net/jdk/pull/7080 From thartmann at openjdk.java.net Mon May 2 07:51:42 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 07:51:42 GMT Subject: RFR: 8285980: Several tests in compiler/c2/irTests miss @requires vm.compiler2.enabled In-Reply-To: <_EgVjIjyFG1tQHsjX6smVcBCPqfwRhVSukyQngQcktc=.d2599deb-5ff9-4aca-a384-6d0e132024c7@github.com> References: <_EgVjIjyFG1tQHsjX6smVcBCPqfwRhVSukyQngQcktc=.d2599deb-5ff9-4aca-a384-6d0e132024c7@github.com> Message-ID: On Mon, 2 May 2022 01:53:57 GMT, Jie Fu wrote: > Hi all, > > Several tests in compiler/c2/irTests fail if C2 is not available. > > The following 7 tests use C2-only VM falgs. > > compiler/c2/irTests/TestSkeletonPredicates.java > compiler/c2/irTests/TestFewIterationsCountedLoop.java > compiler/c2/irTests/TestDuplicateBackedge.java > compiler/c2/irTests/TestStripMiningDropsSafepoint.java > compiler/c2/irTests/TestCountedLoopSafepoint.java > compiler/c2/irTests/TestLongRangeChecks.java > compiler/c2/irTests/TestSuperwordFailsUnrolling.java > > > The following two tests assert that C2 must be available. > > compiler/c2/irTests/TestIRLShiftIdeal_XPlusX_LShiftC.java > compiler/c2/irTests/TestIRAddIdealNotXPlusC.java > > > It would be better to add `@requires vm.compiler2.enabled` to these tests. > > Thanks. > Best regards, > Jie Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8495 From thartmann at openjdk.java.net Mon May 2 07:57:42 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 07:57:42 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: References: Message-ID: <0sfUbR39P0xl9MjYDX7lzlAiLtbKAOw_uWEp-apjUck=.6c8b1b21-8ca7-47e6-ad37-7c73d31983e5@github.com> On Fri, 29 Apr 2022 08:02:07 GMT, Roberto Casta?eda Lozano wrote: > The node reduction flag (`Node::Flag_is_reduction`) is only valid as long as the node remains within the reduction loop in which it was originally marked. This changeset ensures that reduction nodes are unmarked as such if they are extracted out of their associated reduction loop by the peel/main/post loop transformation (`PhaseIdealLoop::insert_pre_post_loops()`). This prevents SLP from wrongly vectorizing, as parallel reductions, outer non-reduction loops to which reduction nodes have been extracted. A more detailed analysis of the failure is available in the [JBS bug report](https://bugs.openjdk.java.net/browse/JDK-8279622). > > The issue could be alternatively fixed at the IGVN level by unmarking reduction nodes as soon as they are decoupled from their corresponding phi and counted loop nodes, but the fix proposed here is simpler and less intrusive. > > The changeset also introduces an assertion at the use point (`SuperWord::transform_loop()`) to check that loops containing reduction nodes are marked as reductions. This invariant could be alternatively placed together with other assertions under `-XX:+VerifyLoopOptimizations`, but [this option is known to be broken](https://bugs.openjdk.java.net/browse/JDK-8173709). > > IR verification using the IR test framework is not feasible for the proposed test case, since the failure is triggered on a OSR compilation, [for which IR verification does not seem to be supported](https://github.com/openjdk/jdk/blob/e7c3b9de649d4b28ba16844e042afcf3c89323e5/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/parser/Line.java#L56-L58). The assertion described above compensates this limitation. > > #### Testing > > ##### Functionality > > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier4-7 (linux-x64; debug mode). > > ##### Performance > > - No significant regression on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. > - No significant difference in generated number of vector instructions when comparing the output of `compiler/vectorization` and `compiler/loopopts/superword` tests using `-XX:+TraceNewVectors` on linux-x64. Nice analysis, looks good to me! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8464 From thartmann at openjdk.java.net Mon May 2 08:04:36 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 08:04:36 GMT Subject: RFR: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly [v3] In-Reply-To: <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> Message-ID: On Fri, 29 Apr 2022 13:37:12 GMT, Roberto Casta?eda Lozano wrote: >> This changeset improves the accuracy of IGV's schedule approximation algorithm by >> >> 1) scheduling pinned nodes in the same block as their corresponding control nodes (or in the immediate successor block for nodes pinned to block projections); and >> 2) scheduling phi input nodes above the phi block, in their corresponding control path. >> >> The combined effect of these scheduling improvements can be seen in the example below. In the current version of IGV **(before)**, `135 ClearArray` is wrongly scheduled in the same block as its output phi node (`91 Phi`). After this changeset **(after)**, `135 ClearArray` is correctly scheduled above the phi node, in its corresponding control path. Since `135 ClearArray` is pinned to the block projection `151 True`, a new block is created between `151 True` and `91 Phi` to accommodate it. >> >> ![fix](https://user-images.githubusercontent.com/8792647/165956029-8e8bae8c-d836-444c-8861-2c13f52c22c6.png) >> >> Additionally, the changeset introduces checks on graph invariants that are assumed by scheduling approximation (e.g. each block projection has a single control successor), warning the IGV user if these invariants are broken. Warning and gracefully degrading the approximated schedule is preferred to just failing since one of IGV's main use cases is debugging graphs which might be ill-formed. The warnings are reported both textually in the IGV log and visually for each node, if the corresponding filter ("Show node warnings") is active: >> >> ![warning](https://user-images.githubusercontent.com/8792647/165957171-50c2bcb9-0247-45cc-b806-c4e811996ce4.png) >> >> Node warnings are implemented as a general filter and can be used in custom filters for other purposes, for example highlighting nodes that match a certain property of interest. >> >> #### Testing >> >> ##### Functionality >> >> - Tested manually that phi inputs and pinned nodes are scheduled correctly for a few selected graphs (included the reported one). >> >> - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not trigger any assertion failure and does not warn with the message "Phi input that does not dominate the phi's input block". >> >> ##### Performance >> >> Measured that the scheduling time is not slowed down for a selection of 89 large graphs (2511-7329 nodes). The performance results are attached (note that each measurement in the sheet corresponds to the median of ten runs). > > Roberto Casta?eda Lozano has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 12 commits: > > - Build dummy blocks in a single pass, refactor scheduleLatest, add warnings > - Merge branch 'master' into JDK-8280568 > - Update copyright years > - Structure error reporting > - Recompute dominator info for final checks, as this is invalidated by block renaming > - Rename all blocks as a last step, to accomodate new blocks > - Schedule nodes pinned to critical-edge projections in edge-splitting blocks > - Make scheduling warning messages more readable > - Sink nodes pinned to block projections when possible > - Fix warning message > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/e333cd33...35bb56fb Awesome, looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7493 From thartmann at openjdk.java.net Mon May 2 08:09:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 08:09:41 GMT Subject: RFR: 8283684: IGV: speed up filter application In-Reply-To: References: Message-ID: On Fri, 1 Apr 2022 10:19:21 GMT, Roberto Casta?eda Lozano wrote: > This change improves view creation time by creating a single JavaScript engine shared among all filters, rather than creating an engine every time a filter is applied. Since creating a JavaScript engine is a costly operation, this change speeds up view creation substantially for small and medium-sized graphs as soon as any filter is applied. This includes the default IGV configuration, where the "Color by category" filter is enabled. > > #### Testing > > ##### Functionality > > - Tested manually applying different filter subsets and differing on a small selection of graphs, for JDK 11 and 17 (which use different versions and ways of packaging the JavaScript engine). > > - Tested automatically viewing thousands of graphs with different subsets of filters enabled (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the view creation time for the default sea-of-nodes view on a selection of 94 medium-sized graphs (200-493 nodes) for different subsets of filters. Before the change, the view creating time increases roughly linearly with the number of applied filters (since an engine is created for each filter application). After the change, the view creating time remains roughly constant (even slightly decreasing) as the number of applied filters increases, yielding an average speedup of 2.4x for the default IGV configuration, and up to 8.2x when five filters are applied. The speedup is expected to diminish for larger graphs where engine creation does not dominate view creation time. The complete results are [attached](https://github.com/openjdk/jdk/files/8396784/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8073 From rcastanedalo at openjdk.java.net Mon May 2 08:26:52 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 08:26:52 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 13:04:55 GMT, Emanuel Peter wrote: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `root` and `target`. With the `filter` string one can easily select which node types to traverse. > > `void print_bfs(Node* root, const uint max_distance, Node* target, const char* filter)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root. > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p print_bfs(find_node(7011), 2, 0, "cdmox+") > No target: perform BFS. > dis par dir dump > ------------------------------------------- > 0 7011 +d 7011 Bool === _ 7010 [[ 7012 ]] [gt] > 1 7011 +d 7010 CmpI === _ 216 213 [[ 7011 ]] > 2 7010 +d 216 AddI === _ 385 386 [[ 9066 7020 7010 385 ]] > 2 7010 +d 213 Phi === 6987 378 6982 [[ 278 .... ]] > > > Example with Mach nodes: > > (rr) p print_bfs(load, 2, 0, "cdmox+#OB") > No target: perform BFS. > dis head idom dep old par dir dump > ------------------------------------------- > 0 314 315 18 o260 144 +d 144 loadF === 314 136 143 [[ 138 270 ]] > 1 314 315 18 _ 144 +c 314 Region === 314 97 [[ 314 95 144 ]] > 1 316 317 16 o305 144 +m 136 MachProj === 105 [[ 135 137 143 144 139 145 190 221 240 258 ]] > 1 316 317 16 o285 144 +d 143 loadN === _ 136 69 [[ 137 144 103 100 202 ]] > 2 315 316 17 o311 314 +c 97 IfTrue === 98 [[ 314 ]] > 2 316 317 16 o306 136 +x 105 membar_acquire === 316 0 147 0 0 [[ 136 104 ]] > 2 49 1 2 o10 143 +d 69 MachProj === 49 [[ 67 132 134 143 145 146 151 169 50 4 247 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `print_bfs(loop_end, 20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p print_bfs(find_node(357), 30, find_node(587), "cox+") > Find shortest path: 357 -> 587. > > Backtrace target. > disdir dump > ------------------------------------- > 21 +c 587 CountedLoop === 587 361 121 [[ 587 589 603 604 ]] > 20 +x 589 MemBarAcquire === 587 1 603 1 1 [[ 588 590 ]] > 19 +c 590 Proj === 589 [[ 591 ]] > 18 +c 591 If === 590 574 [[ 592 605 ]] > 17 +c 592 IfTrue === 591 [[ 576 593 ]] > 16 +c 593 RangeCheck === 592 566 [[ 594 607 ]] > ... > 2 +c 296 RangeCheck === 295 272 [[ 299 316 ]] > 1 +c 299 IfTrue === 296 [[ 258 265 357 ]] > 0 c 357 CountedLoopEnd === 299 356 [[ 358 121 ]] [lt] I tried the functionality and found it useful, thanks! Just a few comments: - What does `dir` stand for in the table header? Whether the node is input/output to its parent? It would perhaps be more intuitive to decouple this information with the node category and put this in a separate column (`cat`?) - A whitespace is missing between the `dis` and `dir` columns when a target node is given. - If this functionality subsumes that provided by `Node::dump(int)` and `Node::dump_ctrl(int)`, could you replace them to avoid code duplication? (perhaps leaving both interfaces to avoid large changes across the code base). ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From rcastanedalo at openjdk.java.net Mon May 2 08:27:43 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 08:27:43 GMT Subject: RFR: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: <0sfUbR39P0xl9MjYDX7lzlAiLtbKAOw_uWEp-apjUck=.6c8b1b21-8ca7-47e6-ad37-7c73d31983e5@github.com> References: <0sfUbR39P0xl9MjYDX7lzlAiLtbKAOw_uWEp-apjUck=.6c8b1b21-8ca7-47e6-ad37-7c73d31983e5@github.com> Message-ID: On Mon, 2 May 2022 07:53:55 GMT, Tobias Hartmann wrote: > Nice analysis, looks good to me! Thanks, Tobias! ------------- PR: https://git.openjdk.java.net/jdk/pull/8464 From rcastanedalo at openjdk.java.net Mon May 2 08:27:46 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 08:27:46 GMT Subject: RFR: 8283684: IGV: speed up filter application In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:06:09 GMT, Tobias Hartmann wrote: > Looks good. Thanks, Tobias! ------------- PR: https://git.openjdk.java.net/jdk/pull/8073 From jbhateja at openjdk.java.net Mon May 2 08:28:10 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 2 May 2022 08:28:10 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 Message-ID: Summary of changes: - Patch intrinsifies following newly added Java SE APIs - Integer.compress - Integer.expand - Long.compress - Long.expand - Adds C2 IR nodes and corresponding ideal transformations for new operations. - We see around ~10x performance speedup due to intrinsification over X86 target. - Adds an IR framework based test to validate newly introduced IR transformations. Kindly review and share your feedback. Best Regards, Jatin ------------- Commit messages: - 8283894: Extending IR framework testcase with some functional test points. - 8283894: Intrinsify compress and expand bits on x86 Changes: https://git.openjdk.java.net/jdk/pull/8498/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8283894 Stats: 764 lines in 14 files changed: 752 ins; 1 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From rcastanedalo at openjdk.java.net Mon May 2 08:28:43 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 2 May 2022 08:28:43 GMT Subject: RFR: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly [v3] In-Reply-To: References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> <0WHRKqSaLfL8uFlQok8n_q0zjs_btVuFV3FBH3GHuaw=.2babc11b-ebdb-423d-8fb1-fbd971ab5cb3@github.com> Message-ID: <3y5Sxhv_15dSoiTcr3v4JkYy5IbfvPVuDE9B4WsuL3U=.59c3bb4d-9b5b-4c1a-9061-4734fc08d3f0@github.com> On Mon, 2 May 2022 08:01:55 GMT, Tobias Hartmann wrote: > Awesome, looks good to me. Thanks, Tobias! ------------- PR: https://git.openjdk.java.net/jdk/pull/7493 From duke at openjdk.java.net Mon May 2 09:02:33 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 09:02:33 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v2] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `filter` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* filter)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(7011)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par dir dump > ------------------------------------------- > 0 7011 +d 7011 Bool === _ 7010 [[ 7012 ]] [gt] > 1 7011 +d 7010 CmpI === _ 216 213 [[ 7011 ]] > 2 7010 +d 216 AddI === _ 385 386 [[ 9066 7020 7010 385 ]] > 2 7010 +d 213 Phi === 6987 378 6982 [[ 278 .... ]] > > > Example with Mach nodes: > > (rr) p load->print_bfs(2, 0, "cdmox+#OB") > No target: perform BFS. > dis head idom dep old par dir dump > ------------------------------------------- > 0 314 315 18 o260 144 +d 144 loadF === 314 136 143 [[ 138 270 ]] > 1 314 315 18 _ 144 +c 314 Region === 314 97 [[ 314 95 144 ]] > 1 316 317 16 o305 144 +m 136 MachProj === 105 [[ 135 137 143 144 139 145 190 221 240 258 ]] > 1 316 317 16 o285 144 +d 143 loadN === _ 136 69 [[ 137 144 103 100 202 ]] > 2 315 316 17 o311 314 +c 97 IfTrue === 98 [[ 314 ]] > 2 316 317 16 o306 136 +x 105 membar_acquire === 316 0 147 0 0 [[ 136 104 ]] > 2 49 1 2 o10 143 +d 69 MachProj === 49 [[ 67 132 134 143 145 146 151 169 50 4 247 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(357)->print_bfs(30, find_node(587), "cox+") > Find shortest path: 357 -> 587. > > Backtrace target. > disdir dump > ------------------------------------- > 21 +c 587 CountedLoop === 587 361 121 [[ 587 589 603 604 ]] > 20 +x 589 MemBarAcquire === 587 1 603 1 1 [[ 588 590 ]] > 19 +c 590 Proj === 589 [[ 591 ]] > 18 +c 591 If === 590 574 [[ 592 605 ]] > 17 +c 592 IfTrue === 591 [[ 576 593 ]] > 16 +c 593 RangeCheck === 592 566 [[ 594 607 ]] > ... > 2 +c 296 RangeCheck === 295 272 [[ 299 316 ]] > 1 +c 299 IfTrue === 296 [[ 258 265 357 ]] > 0 c 357 CountedLoopEnd === 299 356 [[ 358 121 ]] [lt] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: refactored print_bfs to be member function of Node ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/8bae9d14..7405e3be Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=00-01 Stats: 23 lines in 2 files changed: 3 ins; 1 del; 19 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Mon May 2 09:02:34 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 09:02:34 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 13:04:55 GMT, Emanuel Peter wrote: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `filter` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* filter)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(7011)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par dir dump > ------------------------------------------- > 0 7011 +d 7011 Bool === _ 7010 [[ 7012 ]] [gt] > 1 7011 +d 7010 CmpI === _ 216 213 [[ 7011 ]] > 2 7010 +d 216 AddI === _ 385 386 [[ 9066 7020 7010 385 ]] > 2 7010 +d 213 Phi === 6987 378 6982 [[ 278 .... ]] > > > Example with Mach nodes: > > (rr) p load->print_bfs(2, 0, "cdmox+#OB") > No target: perform BFS. > dis head idom dep old par dir dump > ------------------------------------------- > 0 314 315 18 o260 144 +d 144 loadF === 314 136 143 [[ 138 270 ]] > 1 314 315 18 _ 144 +c 314 Region === 314 97 [[ 314 95 144 ]] > 1 316 317 16 o305 144 +m 136 MachProj === 105 [[ 135 137 143 144 139 145 190 221 240 258 ]] > 1 316 317 16 o285 144 +d 143 loadN === _ 136 69 [[ 137 144 103 100 202 ]] > 2 315 316 17 o311 314 +c 97 IfTrue === 98 [[ 314 ]] > 2 316 317 16 o306 136 +x 105 membar_acquire === 316 0 147 0 0 [[ 136 104 ]] > 2 49 1 2 o10 143 +d 69 MachProj === 49 [[ 67 132 134 143 145 146 151 169 50 4 247 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(357)->print_bfs(30, find_node(587), "cox+") > Find shortest path: 357 -> 587. > > Backtrace target. > disdir dump > ------------------------------------- > 21 +c 587 CountedLoop === 587 361 121 [[ 587 589 603 604 ]] > 20 +x 589 MemBarAcquire === 587 1 603 1 1 [[ 588 590 ]] > 19 +c 590 Proj === 589 [[ 591 ]] > 18 +c 591 If === 590 574 [[ 592 605 ]] > 17 +c 592 IfTrue === 591 [[ 576 593 ]] > 16 +c 593 RangeCheck === 592 566 [[ 594 607 ]] > ... > 2 +c 296 RangeCheck === 295 272 [[ 299 316 ]] > 1 +c 299 IfTrue === 296 [[ 258 265 357 ]] > 0 c 357 CountedLoopEnd === 299 356 [[ 358 121 ]] [lt] In conversation with @TobiHartmann we decided to make it a member function of `Node`, and also expose it in `node.hpp`, so that it can be used for print-debugging. ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Mon May 2 09:15:37 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 09:15:37 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:23:24 GMT, Roberto Casta?eda Lozano wrote: >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `filter` string one can easily select which node types to traverse. >> >> `void Node::print_bfs(const uint max_distance, Node* target, const char* filter)` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> **1. Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). >> 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! >> 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. >> 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. >> >> Example: >> >> (rr) p find_node(7011)->print_bfs(2, 0, "cdmox+") >> No target: perform BFS. >> dis par dir dump >> ------------------------------------------- >> 0 7011 +d 7011 Bool === _ 7010 [[ 7012 ]] [gt] >> 1 7011 +d 7010 CmpI === _ 216 213 [[ 7011 ]] >> 2 7010 +d 216 AddI === _ 385 386 [[ 9066 7020 7010 385 ]] >> 2 7010 +d 213 Phi === 6987 378 6982 [[ 278 .... ]] >> >> >> Example with Mach nodes: >> >> (rr) p load->print_bfs(2, 0, "cdmox+#OB") >> No target: perform BFS. >> dis head idom dep old par dir dump >> ------------------------------------------- >> 0 314 315 18 o260 144 +d 144 loadF === 314 136 143 [[ 138 270 ]] >> 1 314 315 18 _ 144 +c 314 Region === 314 97 [[ 314 95 144 ]] >> 1 316 317 16 o305 144 +m 136 MachProj === 105 [[ 135 137 143 144 139 145 190 221 240 258 ]] >> 1 316 317 16 o285 144 +d 143 loadN === _ 136 69 [[ 137 144 103 100 202 ]] >> 2 315 316 17 o311 314 +c 97 IfTrue === 98 [[ 314 ]] >> 2 316 317 16 o306 136 +x 105 membar_acquire === 316 0 147 0 0 [[ 136 104 ]] >> 2 49 1 2 o10 143 +d 69 MachProj === 49 [[ 67 132 134 143 145 146 151 169 50 4 247 ]] >> >> >> **2. Find loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "cox+")` >> This provides us with a shortest path, given this path has a distance of at most 20. >> >> Example: >> >> (rr) p find_node(357)->print_bfs(30, find_node(587), "cox+") >> Find shortest path: 357 -> 587. >> >> Backtrace target. >> disdir dump >> ------------------------------------- >> 21 +c 587 CountedLoop === 587 361 121 [[ 587 589 603 604 ]] >> 20 +x 589 MemBarAcquire === 587 1 603 1 1 [[ 588 590 ]] >> 19 +c 590 Proj === 589 [[ 591 ]] >> 18 +c 591 If === 590 574 [[ 592 605 ]] >> 17 +c 592 IfTrue === 591 [[ 576 593 ]] >> 16 +c 593 RangeCheck === 592 566 [[ 594 607 ]] >> ... >> 2 +c 296 RangeCheck === 295 272 [[ 299 316 ]] >> 1 +c 299 IfTrue === 296 [[ 258 265 357 ]] >> 0 c 357 CountedLoopEnd === 299 356 [[ 358 121 ]] [lt] > > I tried the functionality and found it useful, thanks! Just a few comments: > - What does `dir` stand for in the table header? Whether the node is input/output to its parent? It would perhaps be more intuitive to decouple this information with the node category and put this in a separate column (`cat`?) > - A whitespace is missing between the `dis` and `dir` columns when a target node is given. > - If this functionality subsumes that provided by `Node::dump(int)` and `Node::dump_ctrl(int)`, could you replace them to avoid code duplication? (perhaps leaving both interfaces to avoid large changes across the code base). Thanks for looking at it, @robcasloz ! In reply to your coments: > I tried the functionality and found it useful, thanks! Just a few comments: > > * What does `dir` stand for in the table header? Whether the node is input/output to its parent? It would perhaps be more intuitive to decouple this information with the node category and put this in a separate column (`cat`?) > * A whitespace is missing between the `dis` and `dir` columns when a target node is given. I like the idea of making a `cat` column. And I will have `+/-` in a separate column, but only displayed if both directions are enabled to save space. > * If this functionality subsumes that provided by `Node::dump(int)` and `Node::dump_ctrl(int)`, could you replace them to avoid code duplication? (perhaps leaving both interfaces to avoid large changes across the code base). Yes, it would be nice to avoid code duplication. The question is if people will be ok with having their tools replaced. Maybe we can do it as you say: leave the interfaces identical, and have them redirect to `print_bfs` with the relevant inputs. ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From jiefu at openjdk.java.net Mon May 2 10:39:40 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 2 May 2022 10:39:40 GMT Subject: RFR: 8285980: Several tests in compiler/c2/irTests miss @requires vm.compiler2.enabled In-Reply-To: References: <_EgVjIjyFG1tQHsjX6smVcBCPqfwRhVSukyQngQcktc=.d2599deb-5ff9-4aca-a384-6d0e132024c7@github.com> Message-ID: <2Mgbu_Y7ErTPU2mo-E4kDLbLN_O7PQird8FR71e8eDw=.8f628df3-26c1-4cbd-ae5c-d891827b2ab5@github.com> On Mon, 2 May 2022 07:48:40 GMT, Tobias Hartmann wrote: > Looks good and trivial. Thanks @TobiHartmann for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8495 From jiefu at openjdk.java.net Mon May 2 10:44:45 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 2 May 2022 10:44:45 GMT Subject: Integrated: 8285980: Several tests in compiler/c2/irTests miss @requires vm.compiler2.enabled In-Reply-To: <_EgVjIjyFG1tQHsjX6smVcBCPqfwRhVSukyQngQcktc=.d2599deb-5ff9-4aca-a384-6d0e132024c7@github.com> References: <_EgVjIjyFG1tQHsjX6smVcBCPqfwRhVSukyQngQcktc=.d2599deb-5ff9-4aca-a384-6d0e132024c7@github.com> Message-ID: On Mon, 2 May 2022 01:53:57 GMT, Jie Fu wrote: > Hi all, > > Several tests in compiler/c2/irTests fail if C2 is not available. > > The following 7 tests use C2-only VM falgs. > > compiler/c2/irTests/TestSkeletonPredicates.java > compiler/c2/irTests/TestFewIterationsCountedLoop.java > compiler/c2/irTests/TestDuplicateBackedge.java > compiler/c2/irTests/TestStripMiningDropsSafepoint.java > compiler/c2/irTests/TestCountedLoopSafepoint.java > compiler/c2/irTests/TestLongRangeChecks.java > compiler/c2/irTests/TestSuperwordFailsUnrolling.java > > > The following two tests assert that C2 must be available. > > compiler/c2/irTests/TestIRLShiftIdeal_XPlusX_LShiftC.java > compiler/c2/irTests/TestIRAddIdealNotXPlusC.java > > > It would be better to add `@requires vm.compiler2.enabled` to these tests. > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 1f9f8738 Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/1f9f8738f344ecbc0270608ee84eb92138f349a2 Stats: 9 lines in 9 files changed: 9 ins; 0 del; 0 mod 8285980: Several tests in compiler/c2/irTests miss @requires vm.compiler2.enabled Reviewed-by: thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8495 From duke at openjdk.java.net Mon May 2 12:50:24 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 12:50:24 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v3] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `filter` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* filter)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(7011)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par dir dump > ------------------------------------------- > 0 7011 +d 7011 Bool === _ 7010 [[ 7012 ]] [gt] > 1 7011 +d 7010 CmpI === _ 216 213 [[ 7011 ]] > 2 7010 +d 216 AddI === _ 385 386 [[ 9066 7020 7010 385 ]] > 2 7010 +d 213 Phi === 6987 378 6982 [[ 278 .... ]] > > > Example with Mach nodes: > > (rr) p load->print_bfs(2, 0, "cdmox+#OB") > No target: perform BFS. > dis head idom dep old par dir dump > ------------------------------------------- > 0 314 315 18 o260 144 +d 144 loadF === 314 136 143 [[ 138 270 ]] > 1 314 315 18 _ 144 +c 314 Region === 314 97 [[ 314 95 144 ]] > 1 316 317 16 o305 144 +m 136 MachProj === 105 [[ 135 137 143 144 139 145 190 221 240 258 ]] > 1 316 317 16 o285 144 +d 143 loadN === _ 136 69 [[ 137 144 103 100 202 ]] > 2 315 316 17 o311 314 +c 97 IfTrue === 98 [[ 314 ]] > 2 316 317 16 o306 136 +x 105 membar_acquire === 316 0 147 0 0 [[ 136 104 ]] > 2 49 1 2 o10 143 +d 69 MachProj === 49 [[ 67 132 134 143 145 146 151 169 50 4 247 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(357)->print_bfs(30, find_node(587), "cox+") > Find shortest path: 357 -> 587. > > Backtrace target. > disdir dump > ------------------------------------- > 21 +c 587 CountedLoop === 587 361 121 [[ 587 589 603 604 ]] > 20 +x 589 MemBarAcquire === 587 1 603 1 1 [[ 588 590 ]] > 19 +c 590 Proj === 589 [[ 591 ]] > 18 +c 591 If === 590 574 [[ 592 605 ]] > 17 +c 592 IfTrue === 591 [[ 576 593 ]] > 16 +c 593 RangeCheck === 592 566 [[ 594 607 ]] > ... > 2 +c 296 RangeCheck === 295 272 [[ 299 316 ]] > 1 +c 299 IfTrue === 296 [[ 258 265 357 ]] > 0 c 357 CountedLoopEnd === 299 356 [[ 358 121 ]] [lt] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: small refactoring and some beautification/comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/7405e3be..da91b4ec Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=01-02 Stats: 64 lines in 2 files changed: 25 ins; 19 del; 20 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From jiefu at openjdk.java.net Mon May 2 14:16:03 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 2 May 2022 14:16:03 GMT Subject: RFR: 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java Message-ID: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> Hi all, The four test configurations for `compiler/stable/TestStableShort.java` are the same. /* * @test TestStableShort * @summary tests on stable fields and arrays * @library /test/lib / * @modules java.base/jdk.internal.misc * @modules java.base/jdk.internal.vm.annotation * @build sun.hotspot.WhiteBox * * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 * -XX:-TieredCompilation * -XX:+FoldStableValues * compiler.stable.TestStableShort * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 * -XX:-TieredCompilation * -XX:+FoldStableValues * compiler.stable.TestStableShort * * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 * -XX:-TieredCompilation * -XX:+FoldStableValues * compiler.stable.TestStableShort * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 * -XX:-TieredCompilation * -XX:+FoldStableValues * compiler.stable.TestStableShort */ I believe this is a copy-paste mistake. Let's fix it. The patch just follows the test configurations in TestStable{Byte/Char/Int/Long/Float/Double/Object}.java Thanks. Best regards, Jie ------------- Commit messages: - 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java Changes: https://git.openjdk.java.net/jdk/pull/8503/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8503&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286013 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8503.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8503/head:pull/8503 PR: https://git.openjdk.java.net/jdk/pull/8503 From shade at openjdk.java.net Mon May 2 14:38:45 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 2 May 2022 14:38:45 GMT Subject: RFR: 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java In-Reply-To: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> References: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> Message-ID: On Mon, 2 May 2022 14:08:57 GMT, Jie Fu wrote: > Hi all, > > The four test configurations for `compiler/stable/TestStableShort.java` are the same. > > /* > * @test TestStableShort > * @summary tests on stable fields and arrays > * @library /test/lib / > * @modules java.base/jdk.internal.misc > * @modules java.base/jdk.internal.vm.annotation > * @build sun.hotspot.WhiteBox > * > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > */ > > > I believe this is a copy-paste mistake. > Let's fix it. > > The patch just follows the test configurations in TestStable{Byte/Char/Int/Long/Float/Double/Object}.java > > Thanks. > Best regards, > Jie Yes, this looks like a copy-paste error. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8503 From thartmann at openjdk.java.net Mon May 2 14:46:58 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 2 May 2022 14:46:58 GMT Subject: RFR: 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java In-Reply-To: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> References: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> Message-ID: On Mon, 2 May 2022 14:08:57 GMT, Jie Fu wrote: > Hi all, > > The four test configurations for `compiler/stable/TestStableShort.java` are the same. > > /* > * @test TestStableShort > * @summary tests on stable fields and arrays > * @library /test/lib / > * @modules java.base/jdk.internal.misc > * @modules java.base/jdk.internal.vm.annotation > * @build sun.hotspot.WhiteBox > * > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > */ > > > I believe this is a copy-paste mistake. > Let's fix it. > > The patch just follows the test configurations in TestStable{Byte/Char/Int/Long/Float/Double/Object}.java > > Thanks. > Best regards, > Jie Good catch, looks good to me. This was already reported by [JDK-8203318](https://bugs.openjdk.java.net/browse/JDK-8203318) but then the fix missed this test. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8503 From jvernee at openjdk.java.net Mon May 2 14:47:00 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Mon, 2 May 2022 14:47:00 GMT Subject: RFR: 8286002: Add support for intel syntax to capstone hsdis Message-ID: This patch adds support for outputting assembly in intel syntax to capstone hsdis, through the `-XX:PrintAssemblyOptions=intel` flag. Snippet of example output: [Verified Entry Point] # {method} {0x0000021c8a4002d8} 'add' '(II)I' in 'Main' # parm0: rdx = int # parm1: r8 = int # [sp+0x20] (sp of caller) 0x0000021cfa713780: sub rsp, 0x18 0x0000021cfa713787: mov qword ptr [rsp + 0x10], rbp 0x0000021cfa71378c: mov eax, edx 0x0000021cfa71378e: add eax, r8d 0x0000021cfa713791: add rsp, 0x10 0x0000021cfa713795: pop rbp 0x0000021cfa713796: cmp rsp, qword ptr [r15 + 0x338] ; {poll_return} 0x0000021cfa71379d: ja 0x21cfa7137a4 0x0000021cfa7137a3: ret 0x0000021cfa7137a4: movabs r10, 0x21cfa713796 ; {internal_word} 0x0000021cfa7137ae: mov qword ptr [r15 + 0x350], r10 0x0000021cfa7137b5: jmp 0x21cfa6f3400 ; {runtime_call SafepointBlob} ``` Testing: - Manual testing with and without `-XX:PrintAssemblyOptions=intel`, to make sure that both syntaxes work. - Manual testing with several different invalid options such as `-XX:PrintAssemblyOptions=asdf,,` to make sure that invalid options are handled correctly. Thanks, Jorn ------------- Commit messages: - Remove unneeded stdio include - Print option errors through provided callback - Add defaults - Add support for intel syntax to capstone hsdis Changes: https://git.openjdk.java.net/jdk/pull/8502/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8502&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286002 Stats: 37 lines in 1 file changed: 33 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8502.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8502/head:pull/8502 PR: https://git.openjdk.java.net/jdk/pull/8502 From roland at openjdk.java.net Mon May 2 14:57:43 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 2 May 2022 14:57:43 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 23:55:06 GMT, Vladimir Kozlov wrote: > I am concern about unsigned arithmetic to calculate new limit for long indexing case. The test could simple fill up an array and you then check that values in it are correct (and no out of bounds references). You can choose big `stride` to run test fast. The problem is this code: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/loopnode.cpp#L1762 It makes it impossible to construct a counted loop with limit Integer.MAX_VALUE with a big stride. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From kvn at openjdk.java.net Mon May 2 15:45:46 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 2 May 2022 15:45:46 GMT Subject: RFR: 8283684: IGV: speed up filter application In-Reply-To: References: Message-ID: On Fri, 1 Apr 2022 10:19:21 GMT, Roberto Casta?eda Lozano wrote: > This change improves view creation time by creating a single JavaScript engine shared among all filters, rather than creating an engine every time a filter is applied. Since creating a JavaScript engine is a costly operation, this change speeds up view creation substantially for small and medium-sized graphs as soon as any filter is applied. This includes the default IGV configuration, where the "Color by category" filter is enabled. > > #### Testing > > ##### Functionality > > - Tested manually applying different filter subsets and differing on a small selection of graphs, for JDK 11 and 17 (which use different versions and ways of packaging the JavaScript engine). > > - Tested automatically viewing thousands of graphs with different subsets of filters enabled (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the view creation time for the default sea-of-nodes view on a selection of 94 medium-sized graphs (200-493 nodes) for different subsets of filters. Before the change, the view creating time increases roughly linearly with the number of applied filters (since an engine is created for each filter application). After the change, the view creating time remains roughly constant (even slightly decreasing) as the number of applied filters increases, yielding an average speedup of 2.4x for the default IGV configuration, and up to 8.2x when five filters are applied. The speedup is expected to diminish for larger graphs where engine creation does not dominate view creation time. The complete results are [attached](https://github.com/openjdk/jdk/files/8396784/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8073 From kvn at openjdk.java.net Mon May 2 16:10:06 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 2 May 2022 16:10:06 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 09:29:38 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: > > - undo unneeded change > - Merge branch 'master' into JDK-8281429 > - redo change removed by error > - review > - Merge branch 'master' into JDK-8281429 > - undo > - test fix > - more test > - test & fix > - other fix > - ... and 9 more: https://git.openjdk.java.net/jdk/compare/1ee3c42a...19b38997 I am fine with testing range [MIN_VALUE + stride, MAX_VALUE - stride] to exercise unsigned arithmetic. Whatever maximum loopopts allows. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From psandoz at openjdk.java.net Mon May 2 16:15:46 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 2 May 2022 16:15:46 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:19:53 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Can you update the jtreg tests: 1. Modify `CompressExpandTest` to run with and without the intrinsic enabled 2. Disable (by default) `CompressExpandSanityTest` ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From duke at openjdk.java.net Mon May 2 18:44:40 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 18:44:40 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc Message-ID: Update: After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. Ran larger test suite, all passed. --------- In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. (This happens intermittently, but the regression test did reproduce it at about a 10% rate) In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. We saw 2 options to fix this issue: 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). I picked option 2. Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. There are a few open points for discussion: - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. ------------- Commit messages: - After reverting: the new bug-fix and same test as before - Revert "8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc" - Revert "minor fixes: whitespace and print" - Merge branch 'master' into JDK-8282555 - minor fixes: whitespace and print - 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc Changes: https://git.openjdk.java.net/jdk/pull/7889/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8282555 Stats: 19 lines in 2 files changed: 19 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/7889.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7889/head:pull/7889 PR: https://git.openjdk.java.net/jdk/pull/7889 From dlong at openjdk.java.net Mon May 2 18:44:40 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 2 May 2022 18:44:40 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: Message-ID: <8wNbMgIFuyR08PXXPExCESDNUWfvV-vCSlSUuFjR-68=.620627ae-1419-4d55-b83f-94049655b21f@github.com> On Mon, 21 Mar 2022 11:02:35 GMT, Emanuel Peter wrote: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. I wish I better understood the CISC spilling feature, and why we have special treatment for stack slots. If you can rewrite the rules using Loads, does that mean we can get rid of all special stackSlot and sReg logic? Does InstructForm::needs_anti_dependence_check() come into play at all? ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From jbhateja at openjdk.java.net Mon May 2 18:44:41 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 2 May 2022 18:44:41 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: Message-ID: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> On Mon, 21 Mar 2022 11:02:35 GMT, Emanuel Peter wrote: > This can lead to reversed scheduling, where we read from a stackSlot before we wrote to it, leading to wrong results. Scheduler is free to move around instructions without explicit control/memory edges but under all circumstances it should still honor USE-DEF constrain. Following code[1] explicitly connects stack spilled definition to its user if CISC variant of user is available and this input[2] is later replaced by frame pointer[3] when CISC instruction gets created during fixup_spill, this does not disturb the semantics since instruction is still able read correct value from stack address emitted for stackOperands. This seems to be the root cause of the problem as it disconnects original schedule constraining definition from its user. I think having an extra edge for frame pointer for CISC instructions instead of replacing spill definition edge may guide the scheduler to emit legal schedule. [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/reg_split.cpp#L254 [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1707 [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1726 ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Mon May 2 18:44:41 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 18:44:41 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> Message-ID: <7s2BczN6Dk_Okr9BIXLkeh7-qJ-yEcbfsLxjujoiB58=.ef1625c4-5dec-4313-a95b-88ffc95ceceb@github.com> On Wed, 23 Mar 2022 10:19:41 GMT, Jatin Bhateja wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > >> This can lead to reversed scheduling, where we read from a stackSlot before we wrote to it, leading to wrong results. > > Scheduler is free to move around instructions without explicit control/memory edges but under all circumstances it should still honor USE-DEF constrain. Following code[1] explicitly connects stack spilled definition to its user if CISC variant of user is available and this input[2] is later replaced by frame pointer[3] when CISC instruction gets created during fixup_spill, this does not disturb the semantics since instruction is still able read correct value from stack address emitted for stackOperands. > > This seems to be the root cause of the problem as it disconnects original schedule constraining definition from its user. I think having an extra edge for frame pointer for CISC instructions instead of replacing spill definition edge may guide the scheduler to emit legal schedule. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/reg_split.cpp#L254 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1707 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1726 @jatin-bhateja Yes, the problem is that we overwrite the input-edge to the spill-definition (`src`), with the frame-pointer (`fp`). Good to hear that you agree that we need that extra edge from the spill-definition. We do add back in the spill-definition (`src`), but only if the node permits it: [condition](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1727 ): `cisc->oper_input_base() > 1 && mach->oper_input_base() <= 1` [cisc->ins_req(1,src);](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1729) @jatin-bhateja would you agree that we always need such a edge to the spill-definition (`src`), and therefore instead of the `if` we should rather `assert` the condition (like my proposal)? If so, the question is what to do with the `stackSlotX` operations, as they do not lead to correct `oper_input_base` values, since in the mach-matching they do not use nodes such as `LoadX`, but only `stackSlotX` appears. This is checked in [MatchNode::needs_ideal_memory_edge](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/adlc/formssel.cpp#L3512). Would you agree that we need to make `MatchNode::needs_ideal_memory_edge` return 1 instead of 0? Should we simply add `stackSlotX` to this list, or should we make sure `stackSlotX` does not occur in the mach-matching, instead `LoadX`? @jatin-bhateja A clarification question: do you want the additional spill-definintion (`src`) edge go to the CISC instruction (`cisc`), as it is done now, given the condition is fulfilled? Or do you want to have the additional edge go to the frame-pointer (`fp`)? ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Mon May 2 18:44:42 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 18:44:42 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: <8wNbMgIFuyR08PXXPExCESDNUWfvV-vCSlSUuFjR-68=.620627ae-1419-4d55-b83f-94049655b21f@github.com> References: <8wNbMgIFuyR08PXXPExCESDNUWfvV-vCSlSUuFjR-68=.620627ae-1419-4d55-b83f-94049655b21f@github.com> Message-ID: On Mon, 21 Mar 2022 22:38:39 GMT, Dean Long wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > I wish I better understood the CISC spilling feature, and why we have special treatment for stack slots. If you can rewrite the rules using Loads, does that mean we can get rid of all special stackSlot and sReg logic? > > Does InstructForm::needs_anti_dependence_check() come into play at all? @dean-long it seems like I can at least remove all `stackSlotX` occurances from `x86_64.ad`, and the tests seem to still pass. That is of course no guarantee, I do not know if the tests sufficiently cover this area of the code. On other platforms this may be more involved, I don't know if special logic may be necessary. @dean-long Investigating where `stackSlot` and `sReg` is named / relevant: of course in the ad files. I'm 99% sure they can be removed from `x86_64.ad`. With the other platforms I'm not sure if they need some special handling. But I think there is a good chance it could work to just use `memory` instead of `stackSlot`. And then in the rest of the code. Would be a bit of work to remove it all, but if it does not do anything other than what we can handle with other code, then why keep it around? `./src/hotspot/share/opto/machnode.hpp` // Parameters needed to support MEMORY_INTERFACE access to stackSlot virtual int disp (PhaseRegAlloc *ra_, const Node *node, int idx) const; `MachOper::disp` --> `return 0x00` only used in `ad_x86_peephole.cpp` , lines 130 loadLNode::peephole and 63 loadINode::peephole . Ah, maybe this is for the memory offset calculation. The displacement? `./src/hotspot/share/adlc/archDesc.cpp` in `ArchDesc::initBaseOpTypes` // !!!!! Update - when adding a new sReg/stackSlot type // Create operand types "sReg[IPFDL]" for stack slot registers opForm = constructOperand("sRegI", false); opForm->_constraint = new Constraint("ALLOC_IN_RC", "stack_slots"); opForm = constructOperand("sRegP", false); opForm->_constraint = new Constraint("ALLOC_IN_RC", "stack_slots"); opForm = constructOperand("sRegF", false); opForm->_constraint = new Constraint("ALLOC_IN_RC", "stack_slots"); opForm = constructOperand("sRegD", false); opForm->_constraint = new Constraint("ALLOC_IN_RC", "stack_slots"); opForm = constructOperand("sRegL", false); opForm->_constraint = new Constraint("ALLOC_IN_RC", "stack_slots"); Defining the sReg operand types `./src/hotspot/share/adlc/output_c.cpp` in `OutputReduceOp::map` // operand stackSlot does not have a match rule, but produces a stackSlot if( oper.is_user_name_for_sReg() != Form::none ) reduce = oper.reduce_result(); could get rid off, is special casing for sReg else if( _operand->is_user_name_for_sReg() != Form::none ) { // The only non-constant allowed access to disp is an operand sRegX in a stackSlotX assert( op->ideal_to_sReg_type(type) != Form::none, "StackSlots access displacements using 'sRegs'"); _may_reloc = false; } else { assert( false, "fatal(); Only stackSlots can access a non-constant using 'disp'"); } also seems special casing for sReg. The assert in the else case would have to say that only constants can be accessed using 'disp'. `./src/hotspot/share/adlc/formssel.cpp` `OperandForm::is_user_name_for_sReg` maps `_ident` from eg `stackSlotI` to ideal types, eg `Form::idealI` int MatchRule::is_ideal_copy() const { if (is_chain_rule(_AD.globalNames()) && _lChild && strncmp(_lChild->_opType, "stackSlot", 9) == 0) { return 1; } return 0; } probably matches all `stackSlotX`. Hmm let's see where this is used, seems specific. but I think this is just a parallel case to `is_ideal_load` which would apply for `LoadI` etc, where `is_ideal_copy` seems to be for stackSlot. ---- follow up search for `sRegX` : `./src/hotspot/cpu/x86/x86_64.ad`:`operand stackSlotI(sRegI reg)` definition of stackSlotX in ad files (same in other ad files) `./src/hotspot/share/adlc/archDesc.cpp`: `if (strcmp(rootOp,"sRegI")==0) continue;` in `ArchDesc::inspectOperands `: special casing them out of some logic -> idk understand this does ??? `./src/hotspot/share/adlc/archDesc.cpp`: `opForm = constructOperand("sRegI", false);` defining the sReg operand types (see above already) `./src/hotspot/share/adlc/forms.cpp`: `if (strcmp(name,"sRegI")==0) return Form::idealI;` in `Form::ideal_to_sReg_type` -> see below for usage ------ follow up `Form::ideal_to_sReg_type` `./src/hotspot/share/adlc/output_c.cpp` special case, can eliminate -> generates special code, probably other path leads to same result (with is_base_register) `./src/hotspot/share/adlc/output_h.cpp`: same `./src/hotspot/share/adlc/formssel.cpp`: same `./src/hotspot/share/adlc/forms.cpp`: definition ------ follow up `is_user_name_for_sReg` `./src/hotspot/share/adlc/output_c.cpp` is all special caseing, I think it can be removed `./src/hotspot/share/adlc/output_h.cpp` same (edited) @dean-long Of course the question is if we should really remove all this logic and specification in ad files under this bug. Maybe we can do something minimal now, and file separate RFEs to: 1) remove stackSlot from each platform's ad file -> RFE per platform 2) once stackSlot does not occur in any ad file, we should be able to remove the logic from the code In this we may have to pay attention if there is a performance regression. Not sure if this could happen in this case, but we'd have to make sure anyway (thanks @TobiHartmann for bringing this up). ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From jbhateja at openjdk.java.net Mon May 2 18:44:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 2 May 2022 18:44:43 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> Message-ID: On Wed, 23 Mar 2022 10:19:41 GMT, Jatin Bhateja wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > >> This can lead to reversed scheduling, where we read from a stackSlot before we wrote to it, leading to wrong results. > > Scheduler is free to move around instructions without explicit control/memory edges but under all circumstances it should still honor USE-DEF constrain. Following code[1] explicitly connects stack spilled definition to its user if CISC variant of user is available and this input[2] is later replaced by frame pointer[3] when CISC instruction gets created during fixup_spill, this does not disturb the semantics since instruction is still able read correct value from stack address emitted for stackOperands. > > This seems to be the root cause of the problem as it disconnects original schedule constraining definition from its user. I think having an extra edge for frame pointer for CISC instructions instead of replacing spill definition edge may guide the scheduler to emit legal schedule. > > [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/reg_split.cpp#L254 > [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1707 > [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1726 > @jatin-bhateja Yes, the problem is that we overwrite the input-edge to the spill-definition (`src`), with the frame-pointer (`fp`). Good to hear that you agree that we need that extra edge from the spill-definition. > > We do add back in the spill-definition (`src`), but only if the node permits it: [condition](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1727): `cisc->oper_input_base() > 1 && mach->oper_input_base() <= 1` [cisc->ins_req(1,src);](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1729) > > @jatin-bhateja would you agree that we always need such a edge to the spill-definition (`src`), and therefore instead of the `if` we should rather `assert` the condition (like my proposal)? If so, the question is what to do with the `stackSlotX` operations, as they do not lead to correct `oper_input_base` values, since in the mach-matching they do not use nodes such as `LoadX`, but only `stackSlotX` appears. This is checked in [MatchNode::needs_ideal_memory_edge](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/adlc/formssel.cpp#L3512). Would you agree that we need to make `MatchNode::needs_ideal_memory_edge` return 1 instead of 0? Should we simply add `stackSlotX` to this list, or should we make sure `stackSlotX` does not occur in the mach-matching, instead `LoadX`? > > @jatin-bhateja A clarification question: do you want the additional spill-definintion (`src`) edge go to the CISC instruction (`cisc`), as it is done now, given the condition is fulfilled? Or do you want to have the additional edge go to the frame-pointer (`fp`)? @eme64 , I think adding a precedence edge b/w SPILLED_DEF and CISC instruction after replacing SPILLE_DEFs with FP should add necessary data dependency constraint to enable generating a legal schedule. ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Mon May 2 18:44:45 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 18:44:45 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: <8wNbMgIFuyR08PXXPExCESDNUWfvV-vCSlSUuFjR-68=.620627ae-1419-4d55-b83f-94049655b21f@github.com> Message-ID: On Wed, 23 Mar 2022 14:57:16 GMT, Emanuel Peter wrote: >> I wish I better understood the CISC spilling feature, and why we have special treatment for stack slots. If you can rewrite the rules using Loads, does that mean we can get rid of all special stackSlot and sReg logic? >> >> Does InstructForm::needs_anti_dependence_check() come into play at all? > > @dean-long > Of course the question is if we should really remove all this logic and specification in ad files under this bug. > Maybe we can do something minimal now, and file separate RFEs to: > 1) remove stackSlot from each platform's ad file -> RFE per platform > 2) once stackSlot does not occur in any ad file, we should be able to remove the logic from the code > In this we may have to pay attention if there is a performance regression. Not sure if this could happen in this case, but we'd have to make sure anyway (thanks @TobiHartmann for bringing this up). > @eme64 , I think adding a precedence edge b/w SPILLED_DEF and CISC instruction after replacing SPILLE_DEFs with FP should add necessary data dependency constraint to enable generating a legal schedule. @jatin-bhateja Great, let's add such a memory edge! How should I do that? I have two possible solutions, see above for more details: 1. Add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge` match list. 2. Edit `x86_64.ad` (and other ad files): use `memory` instead of `stackSlot` for `MoveI2F`. ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From jbhateja at openjdk.java.net Mon May 2 18:44:46 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 2 May 2022 18:44:46 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> Message-ID: On Mon, 28 Mar 2022 02:21:34 GMT, Jatin Bhateja wrote: >>> This can lead to reversed scheduling, where we read from a stackSlot before we wrote to it, leading to wrong results. >> >> Scheduler is free to move around instructions without explicit control/memory edges but under all circumstances it should still honor USE-DEF constrain. Following code[1] explicitly connects stack spilled definition to its user if CISC variant of user is available and this input[2] is later replaced by frame pointer[3] when CISC instruction gets created during fixup_spill, this does not disturb the semantics since instruction is still able read correct value from stack address emitted for stackOperands. >> >> This seems to be the root cause of the problem as it disconnects original schedule constraining definition from its user. I think having an extra edge for frame pointer for CISC instructions instead of replacing spill definition edge may guide the scheduler to emit legal schedule. >> >> [1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/reg_split.cpp#L254 >> [2] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1707 >> [3] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/chaitin.cpp#L1726 > >> @jatin-bhateja Yes, the problem is that we overwrite the input-edge to the spill-definition (`src`), with the frame-pointer (`fp`). Good to hear that you agree that we need that extra edge from the spill-definition. >> >> We do add back in the spill-definition (`src`), but only if the node permits it: [condition](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1727): `cisc->oper_input_base() > 1 && mach->oper_input_base() <= 1` [cisc->ins_req(1,src);](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1729) >> >> @jatin-bhateja would you agree that we always need such a edge to the spill-definition (`src`), and therefore instead of the `if` we should rather `assert` the condition (like my proposal)? If so, the question is what to do with the `stackSlotX` operations, as they do not lead to correct `oper_input_base` values, since in the mach-matching they do not use nodes such as `LoadX`, but only `stackSlotX` appears. This is checked in [MatchNode::needs_ideal_memory_edge](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/adlc/formssel.cpp#L3512). Would you agree that we need to make `MatchNode::needs_ideal_memory_edge` return 1 instead of 0? Should we simply add `stackSlotX` to this list, or should we make sure `stackSlotX` does not occur in the mach-matching, instead `LoadX`? >> >> @jatin-bhateja A clarification question: do you want the additional spill-definintion (`src`) edge go to the CISC instruction (`cisc`), as it is done now, given the condition is fulfilled? Or do you want to have the additional edge go to the frame-pointer (`fp`)? > > @eme64 , I think adding a precedence edge b/w SPILLED_DEF and CISC instruction after replacing SPILLE_DEFs with FP should add necessary data dependency constraint to enable generating a legal schedule. > @jatin-bhateja Great, let's add such a memory edge! How should I do that? Please check [add_prec](https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/node.hpp#L549) and its usage https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/output.cpp#L3057 ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Mon May 2 18:44:47 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 18:44:47 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> Message-ID: On Tue, 29 Mar 2022 10:20:20 GMT, Jatin Bhateja wrote: >>> @jatin-bhateja Yes, the problem is that we overwrite the input-edge to the spill-definition (`src`), with the frame-pointer (`fp`). Good to hear that you agree that we need that extra edge from the spill-definition. >>> >>> We do add back in the spill-definition (`src`), but only if the node permits it: [condition](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1727): `cisc->oper_input_base() > 1 && mach->oper_input_base() <= 1` [cisc->ins_req(1,src);](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/opto/chaitin.cpp#L1729) >>> >>> @jatin-bhateja would you agree that we always need such a edge to the spill-definition (`src`), and therefore instead of the `if` we should rather `assert` the condition (like my proposal)? If so, the question is what to do with the `stackSlotX` operations, as they do not lead to correct `oper_input_base` values, since in the mach-matching they do not use nodes such as `LoadX`, but only `stackSlotX` appears. This is checked in [MatchNode::needs_ideal_memory_edge](https://github.com/openjdk/jdk/blob/6ed0ba2f8a2af58c45a6b7be684ef30d15af6ead/src/hotspot/share/adlc/formssel.cpp#L3512). Would you agree that we need to make `MatchNode::needs_ideal_memory_edge` return 1 instead of 0? Should we simply add `stackSlotX` to this list, or should we make sure `stackSlotX` does not occur in the mach-matching, instead `LoadX`? >>> >>> @jatin-bhateja A clarification question: do you want the additional spill-definintion (`src`) edge go to the CISC instruction (`cisc`), as it is done now, given the condition is fulfilled? Or do you want to have the additional edge go to the frame-pointer (`fp`)? >> >> @eme64 , I think adding a precedence edge b/w SPILLED_DEF and CISC instruction after replacing SPILLE_DEFs with FP should add necessary data dependency constraint to enable generating a legal schedule. > >> @jatin-bhateja Great, let's add such a memory edge! How should I do that? > > Please check [add_prec](https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/node.hpp#L549) and its usage > https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/output.cpp#L3057 @jatin-bhateja For all other cases (eg `addI`, `convI2L`, etc ) we are currently using https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1729 This seems to add the dependencies before the inputs. But that depends on there being space before the inputs. That is why we check `cisc->oper_input_base() > 1` https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1727 All other cases where we have spilling (eg `addI`, `convI2L`, etc ), we have `cisc->oper_input_base() == 2`. I think `_in[0]` is for control, and `_in[1]` for the memory edge (in the other cases not including `MoveF2I` etc). `cisc->oper_input_base() == 2` in all other cases, because in `InstructForm::oper_input_base` we ask for `MatchNode::needs_ideal_memory_edge`. In all cases except for `MoveF2I` etc, we say we need a memory edge there. Now we don't have space to set a memory edge before the inputs, and we simply do not set one. So how do we work with this? Do I just remove `cisc->ins_req(1,src)` for all cases and alway use `add_req` as you have suggested? Or do I leave the other cases, and just make an else case and use `add_prec` there for `MoveF2I` etc? I am wondering what is a clean and consistent solution here. Do you know why we add the memory edge before the inputs in the other cases? If I use `add_prec` then that adds the memory edge after the inputs, correct? ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From jbhateja at openjdk.java.net Mon May 2 18:44:47 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 2 May 2022 18:44:47 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> Message-ID: <0SZNw-9p9QDyotJDq-E4piJ-U2Jxc3vRMozOeyGWqn8=.115cc0d3-2a36-40bb-ad45-c10032fd9a8b@github.com> On Tue, 29 Mar 2022 10:37:21 GMT, Emanuel Peter wrote: > Do you know why we add the memory edge before the inputs in the other cases? If I use `add_prec` then that adds the memory edge after the inputs, correct? Currently memory edges are being added for instructions which directly access memory, ADLC enforces this by scanning through Ideal nodes of a matcher pattern in top-down manner. Almost all the machine nodes decorated with **Flag_is_cisc_alternate** flag access Load/Store IR in their selection patterns. Only exceptions[1][2][3] as you pointed out are the ones which perform load/store from stack locations. Thus all I am suggesting is for all such cases without doing many changes we can add the DEF_Spill precedence edge which gets added after all the inputs but will still constrain the scheduling order. Thus an instructions which has a CISC alternate but lacks memory_operand can be handled by adding a prescience edge. [1] MoveF2I_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } [2] MoveI2F_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } [3] MoveD2L_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Mon May 2 18:44:48 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 2 May 2022 18:44:48 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: Message-ID: On Mon, 21 Mar 2022 11:02:35 GMT, Emanuel Peter wrote: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. I was out of the office, picking this back up again. I am not sure I understand what you (@jatin-bhateja) are concretely suggesting. The simplest solution that follows what you are saying is this: We add an `else` case after https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1727-L1730 with `cisc->add_prec(src);` and a comment saying: "In some rare cases, there is no space for a memory edge before the inputs. We always need a memory edge from src to cisc, else we might schedule cisc before src, loading from a spill location before storing the spill." @TobiHartmann thought that maybe you agreed with my "Option 1" above, to explicitly add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. In the cases where we do not have a `Load/Store`, but only `stackSlot`, this would still match and add the required memory edge. @jatin-bhateja @vnkozlov Both solutions seem ad-hoc to me, they are a bit ugly because they make a special case for a rare occasion. On the other hand, we avoid having to edit all the platform-specific code (which would be required if we were to eliminate `stackSlot` from `MoveF2I` etc, or remove `stackSlot` completely, if even possible). ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From bulasevich at openjdk.java.net Mon May 2 20:22:23 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Mon, 2 May 2022 20:22:23 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Fri, 29 Apr 2022 20:54:43 GMT, Dean Long wrote: > @bulasevich , are you sure the code pattern described in the comment is no longer a problem? > call [...] > [Exception Handler] > (PC from call will be here, inside exception handler) Yes, I am pretty sure. I followed the description provided for the nop. These nops (see the assembly listing above) are actually placed not before the [Exception Handler] as expected, but in the C1 code section, in between [[slow case stubs + optionally(exception adapters)]](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_Compilation.cpp#L290) and the [unwind handler]. I did not found the assertion mentioned in the description either by eye or by running tests on different platforms. > Now that 8172844 has relaxed a related assert, it is probably safe to remove these NOPs now. I believe JDK-8172844 was a C2 issue. Looking to [that change](http://hg.openjdk.java.net/jdk9/jdk9/hotspot/rev/c576bd949a9d), I think it would be good to remove the unused methods: - [CompiledMethod::insts_contains](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/compiledMethod.hpp#L268) - [nmethod::is_patchable_at](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/nmethod.cpp#L2301) - [CodeBuffer::insts_contains, CodeBuffer::insts_contains2](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L599) though, I feel it should be a separate change. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From duke at openjdk.java.net Mon May 2 20:28:22 2022 From: duke at openjdk.java.net (aamarsh) Date: Mon, 2 May 2022 20:28:22 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: References: Message-ID: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) > Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: adding escape analysis and scalar replacement statistics ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/2b7edc42..18328bb3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=05-06 Stats: 67 lines in 6 files changed: 19 ins; 34 del; 14 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Mon May 2 20:58:23 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 2 May 2022 20:58:23 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: Message-ID: On Mon, 21 Mar 2022 11:02:35 GMT, Emanuel Peter wrote: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. I agree with this solution as bug fix. Your first suggestion to use `memory` parameters instead of `stackSlotX` sounds reasonable but it would require a lot more changes for bug fix. Other mach instruction (and on other platforms) may have the same issue. src/hotspot/share/opto/chaitin.cpp line 1732: > 1730: } else { > 1731: // In some rare cases: > 1732: // There is no space reserved for a memory edge before the inputs. It sounds like random case but you have very specific case: Mach instructions which uses `stackSlotX`. I think these 2 lines should say that. Something like: // There is no space reserved for a memory edge before the inputs for // instructions which have "stackSlotX" parameter instead of "memory". // For example, "MoveF2I_stack_reg". ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From jrose at openjdk.java.net Mon May 2 21:38:36 2022 From: jrose at openjdk.java.net (John R Rose) Date: Mon, 2 May 2022 21:38:36 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:19:53 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/share/opto/intrinsicnode.cpp line 159: > 157: } > 158: > 159: Node* compress_expand_identity(PhaseGVN* phase, Node* n) { // note that these identities apply to both compress and expand ? // compress(x, 0) == 0, expand(x, 0) == 0 ? // compress(x, -1) == x, expand(x, -1) == x Also, from the javadoc: // expand(-1, m) == m (but not for compress) ... ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Mon May 2 21:57:20 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 2 May 2022 21:57:20 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Mon, 2 May 2022 20:18:31 GMT, Boris Ulasevich wrote: > though, I feel it should be a separate change. Yes, it should be separate change. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From dlong at openjdk.java.net Mon May 2 22:06:28 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 2 May 2022 22:06:28 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} Marked as reviewed by dlong (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From dlong at openjdk.java.net Mon May 2 22:13:22 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 2 May 2022 22:13:22 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} I tried to find out the original bug that caused the nop to be added, but couldn't find much. I did find that the nop at one point was emitted in the caller `emit_code_epilog`, and that was before the introduction of emit_slow_case_stubs(). If there's still an issue that requires a nop, I can't find it. Approved. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From kvn at openjdk.java.net Mon May 2 22:44:23 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 2 May 2022 22:44:23 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} The comment and `nop()` call existed from day one of creation `c1_LIRAssembler_x86.cpp` file. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8341 From jiefu at openjdk.java.net Mon May 2 22:46:30 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 2 May 2022 22:46:30 GMT Subject: RFR: 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java In-Reply-To: References: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> Message-ID: On Mon, 2 May 2022 14:35:35 GMT, Aleksey Shipilev wrote: >> Hi all, >> >> The four test configurations for `compiler/stable/TestStableShort.java` are the same. >> >> /* >> * @test TestStableShort >> * @summary tests on stable fields and arrays >> * @library /test/lib / >> * @modules java.base/jdk.internal.misc >> * @modules java.base/jdk.internal.vm.annotation >> * @build sun.hotspot.WhiteBox >> * >> * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp >> * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 >> * -XX:-TieredCompilation >> * -XX:+FoldStableValues >> * compiler.stable.TestStableShort >> * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp >> * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 >> * -XX:-TieredCompilation >> * -XX:+FoldStableValues >> * compiler.stable.TestStableShort >> * >> * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp >> * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 >> * -XX:-TieredCompilation >> * -XX:+FoldStableValues >> * compiler.stable.TestStableShort >> * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp >> * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 >> * -XX:-TieredCompilation >> * -XX:+FoldStableValues >> * compiler.stable.TestStableShort >> */ >> >> >> I believe this is a copy-paste mistake. >> Let's fix it. >> >> The patch just follows the test configurations in TestStable{Byte/Char/Int/Long/Float/Double/Object}.java >> >> Thanks. >> Best regards, >> Jie > > Yes, this looks like a copy-paste error. Thanks @shipilev and @TobiHartmann for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8503 From jiefu at openjdk.java.net Mon May 2 22:46:32 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 2 May 2022 22:46:32 GMT Subject: Integrated: 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java In-Reply-To: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> References: <_Ac0IBXRquCHlS4ejEqh6zUWp8IzWLgLa9R3h334sVc=.29336c90-6772-48f2-aa98-21c44f10ea5f@github.com> Message-ID: On Mon, 2 May 2022 14:08:57 GMT, Jie Fu wrote: > Hi all, > > The four test configurations for `compiler/stable/TestStableShort.java` are the same. > > /* > * @test TestStableShort > * @summary tests on stable fields and arrays > * @library /test/lib / > * @modules java.base/jdk.internal.misc > * @modules java.base/jdk.internal.vm.annotation > * @build sun.hotspot.WhiteBox > * > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > * @run main/bootclasspath/othervm -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI -Xcomp > * -XX:CompileOnly=::get,::get1,::get2,::get3,::get4 > * -XX:-TieredCompilation > * -XX:+FoldStableValues > * compiler.stable.TestStableShort > */ > > > I believe this is a copy-paste mistake. > Let's fix it. > > The patch just follows the test configurations in TestStable{Byte/Char/Int/Long/Float/Double/Object}.java > > Thanks. > Best regards, > Jie This pull request has now been integrated. Changeset: 3420a1aa Author: Jie Fu URL: https://git.openjdk.java.net/jdk/commit/3420a1aa70c99b502368ef3f0edc6acea7a2bf1c Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod 8286013: Incorrect test configurations for compiler/stable/TestStableShort.java Reviewed-by: shade, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8503 From duke at openjdk.java.net Mon May 2 23:36:28 2022 From: duke at openjdk.java.net (aamarsh) Date: Mon, 2 May 2022 23:36:28 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v6] In-Reply-To: <0weQWowrvgwLelsSjB30Tn4DPy6SbJrVDzPhJLONzOQ=.7a9376dd-216f-4cce-86ff-240f88b5f143@github.com> References: <1AOS3uAS-1QSWGshTTP0QlQGRvcBD21J863mZj7G7AE=.0de7fae6-8a67-49ae-b4e5-1378a7b2908f@github.com> <0weQWowrvgwLelsSjB30Tn4DPy6SbJrVDzPhJLONzOQ=.7a9376dd-216f-4cce-86ff-240f88b5f143@github.com> Message-ID: On Sun, 10 Apr 2022 23:23:00 GMT, Xin Liu wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/compile.cpp line 2202: > >> 2200: >> 2201: #ifndef PRODUCT >> 2202: congraph()->update_escape_state(Atomic::load(&PhaseMacroExpand::_objs_scalar_replaced_counter) - _prev_scalar_replaced); > > I understand that you would like to add scalarized objects back to the snapshot, but this _objs_scalar_replaced_counter is a static counter as well. Two consecutive atomic loads are not helpful here because other C2 compiler threads can update it. > > I think we can a member data ConnectionGraph::_prev_scalar_replaced and increment it in mexp.eliminate_macro_nodes(); we keep _objs_scalar_replaced_counter but only atomic accumulate it at the end of ME phrase. @navyxliu I followed the general idea of your advice and created _local_scalar_replaced, which gets added in compiler.cpp. > src/hotspot/share/opto/escape.cpp line 116: > >> 114: invocation = C->congraph()->_invocation + 1; >> 115: #ifndef PRODUCT >> 116: // Reset counters when do_analysis is called again so objects are not double counted > > This does not look right either. 3 atomic words can't give you a safe transaction. > I understand that you want to dedup across Iterative EAs. Snapshot is certainly a general solution but more complex. > > In our case, I feel all we need is to compensate those java objects which have been scalarized in previous iterations. One member data _prev_scalar_replaced, which remembers the number of scalarized objects, seems good enough. you can carry over this variable from the old to the new ConnectionGraph and add it back when the EA iterations end. @navyxliu Again, to avoid this I added local counters to the compiler that reset with every do_analysis call and once EA is completed and the while loop is executed they are atomically added to the static counters. > src/hotspot/share/opto/escape.cpp line 3795: > >> 3793: >> 3794: void ConnectionGraph::print_statistics() { >> 3795: tty->print_cr("No escape: %d", Atomic::load(&_no_escape_counter)); > > This is just my suggestion. I would say that we keep each print_statistics() method oneliner. > It will ease parsers. in Compile::print_statistics, most of them follow the pattern: phase: counter1, couter2, ... > ``` > tty->print_cr("Peephole: peephole rules applied: %d", _total_peepholes); @navyxliu Fixed! And I included a new example of how it prints above! ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From jrose at openjdk.java.net Mon May 2 23:50:22 2022 From: jrose at openjdk.java.net (John R Rose) Date: Mon, 2 May 2022 23:50:22 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:19:53 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/share/opto/intrinsicnode.cpp line 213: > 211: } > 212: > 213: Node* ExpandBitsNode::Identity(PhaseGVN* phase) { I also suggest adding a boolean if `compress_expand_identity` if you add rules which don't apply to both equally. Here is possible type-propagation logic for compress and expand: let SIGN_BIT = (((IntOrLong)-1)>>>1)+1 (bit 31 or 63) let MAX_POS = (((IntOrLong)-1)>>>1) lot BITS = 1+bitCount(MAX_POS) (32 or 64) if (both x, m are con) { // maybe use these rules, by porting the Java code to C++ compress(CON[x], CON[m]) ] = CON[portable_compress(x,m)] expand(CON[x], CON[m]) ] = CON[portable_expand(x,m)] // see also https://stackoverflow.com/questions/38938911/portable-efficient-alternative-to-pdep-without-using-bmi2 } else if (m is CON[m] && m != -1) { //compress(x, -1) = x //identity handled elsewhere //expand(x, -1) = x //identity handled elsewhere let bitc = bitCount(m) LO[ compress(x, CON[m]) ] = 0 //sign bit is never set HI[ compress(x, CON[m]) ] = ((1L<= 0) ? 0 : SIGN_BIT //sign bit might be set alone HI[ expand(x, CON[m]) ] = (m >= 0) ? m : m ^ SIGN_BIT // could improve a little by looking TYPE[x], but do not bother } else { // estimate maximum possible weight of m (in 0..63) let maxbitc = BITS if (LO[m] < 0 && HI[m] >= -1) // could be -1 else maxbitc = BITS-1 if (LO[m] < 0 || HI[m] == MAX_POS) // <0 or maxint else maxbitc = BITS-1 - numberOfLeadingZeros(HI[m]) LO[ compress(x, m) ] = (maxbitc == 64 && LO[x] < 0) ? SIGN_BIT : 0 HI[ compress(x, m) ] = (maxbitc >= 63) ? HI[x] : MIN(HI[x], (1L<= 0) ? 0 : SIGN_BIT HI[ expand(x, m) ] = (LO[m] >= 0) ? HI[m] : MAX_POS } The operands of compress and expand are inherently unsigned bitmasks, so the signed type system of C2 gets in the way. In the future, a somewhat more thorough job could be done if we had bitwise types as well in C2. For that that would mean, see https://bugs.openjdk.java.net/browse/JDK-8001436 ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From dlong at openjdk.java.net Mon May 2 23:57:30 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 2 May 2022 23:57:30 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Mon, 2 May 2022 22:40:27 GMT, Vladimir Kozlov wrote: > The comment and nop() call existed from day one of creation c1_LIRAssembler_x86.cpp file. That's what I found too, looking back at ancient c1_LIRAssembler_i486.cpp in the CodeManager SCCS tree. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From jrose at openjdk.java.net Tue May 3 00:08:19 2022 From: jrose at openjdk.java.net (John R Rose) Date: Tue, 3 May 2022 00:08:19 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:19:53 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/share/opto/intrinsicnode.cpp line 155: > 153: return new AndLNode(compr, src->in(1)); > 154: } > 155: } I think a further rule for `compress(m, m)` could be in order. compress(m, m) = m==-1 ? m : (1L << PopCount[IL](m))-1 This should be its own path through `Ideal`, not special logic at this particular point. Don't use it unless `Matcher::match_rule_supported(Op_PopCount[IL])` is true. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From xliu at openjdk.java.net Tue May 3 01:28:03 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 3 May 2022 01:28:03 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot Message-ID: Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. ------------- Commit messages: - 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot Changes: https://git.openjdk.java.net/jdk/pull/8513/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8513&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285976 Stats: 10 lines in 1 file changed: 10 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8513.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8513/head:pull/8513 PR: https://git.openjdk.java.net/jdk/pull/8513 From dlong at openjdk.java.net Tue May 3 01:53:45 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 3 May 2022 01:53:45 GMT Subject: RFR: 8285885: Replay compilation fails with assert(is_valid()) failed: check invoke Message-ID: This change makes replay more tolerant so it will fail gracefully instead of assert if it can't find an invoke bytecode at the desired bci. ------------- Commit messages: - handle bytecode lookup failures Changes: https://git.openjdk.java.net/jdk/pull/8514/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8514&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285885 Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8514.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8514/head:pull/8514 PR: https://git.openjdk.java.net/jdk/pull/8514 From thartmann at openjdk.java.net Tue May 3 06:21:11 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 3 May 2022 06:21:11 GMT Subject: RFR: 8285885: Replay compilation fails with assert(is_valid()) failed: check invoke In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:46:22 GMT, Dean Long wrote: > This change makes replay more tolerant so it will fail gracefully instead of assert if it can't find an invoke bytecode at the desired bci. Looks good. Thanks for fixing this! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8514 From duke at openjdk.java.net Tue May 3 07:00:53 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 3 May 2022 07:00:53 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v2] In-Reply-To: References: Message-ID: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: Comments adapted to review ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7889/files - new: https://git.openjdk.java.net/jdk/pull/7889/files/333a18d8..8c7ef702 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=00-01 Stats: 3 lines in 1 file changed: 1 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7889.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7889/head:pull/7889 PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Tue May 3 07:00:55 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 3 May 2022 07:00:55 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v2] In-Reply-To: References: Message-ID: On Mon, 2 May 2022 20:36:05 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Comments adapted to review > > src/hotspot/share/opto/chaitin.cpp line 1732: > >> 1730: } else { >> 1731: // In some rare cases: >> 1732: // There is no space reserved for a memory edge before the inputs. > > It sounds like random case but you have very specific case: Mach instructions which uses `stackSlotX`. I think these 2 lines should say that. Something like: > > // There is no space reserved for a memory edge before the inputs for > // instructions which have "stackSlotX" parameter instead of "memory". > // For example, "MoveF2I_stack_reg". @vnkozlov thanks for the suggestion, I put it in. ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From thartmann at openjdk.java.net Tue May 3 07:14:30 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 3 May 2022 07:14:30 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v2] In-Reply-To: References: Message-ID: On Tue, 3 May 2022 07:00:53 GMT, Emanuel Peter wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > Comments adapted to review Nice analysis. Looks good to me. test/hotspot/jtreg/compiler/intrinsics/unsafe/HeapByteBufferTest.java line 45: > 43: * @library /test/lib > 44: * > 45: * @run main/othervm -Djdk.test.lib.random.seed=42 Please add an additional run without a fixed seed. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7889 From rcastanedalo at openjdk.java.net Tue May 3 07:29:24 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 3 May 2022 07:29:24 GMT Subject: RFR: 8283684: IGV: speed up filter application In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:24:41 GMT, Roberto Casta?eda Lozano wrote: > good Thanks for reviewing, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8073 From rcastanedalo at openjdk.java.net Tue May 3 07:31:23 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 3 May 2022 07:31:23 GMT Subject: Integrated: 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly In-Reply-To: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> References: <-iYiPRIR5iUEQyHWbTW0j2wwWjPS7YSfsp_ikHPAW54=.fa427507-e74d-4cfa-bd49-8dc7c935346c@github.com> Message-ID: <5tBpkn1zwxKVneOW4WWthY9su-h-Qxz_ZaKx0tSoUW0=.f52e9f92-2789-40c3-911a-faebc8301432@github.com> On Wed, 16 Feb 2022 11:05:17 GMT, Roberto Casta?eda Lozano wrote: > This changeset improves the accuracy of IGV's schedule approximation algorithm by > > 1) scheduling pinned nodes in the same block as their corresponding control nodes (or in the immediate successor block for nodes pinned to block projections); and > 2) scheduling phi input nodes above the phi block, in their corresponding control path. > > The combined effect of these scheduling improvements can be seen in the example below. In the current version of IGV **(before)**, `135 ClearArray` is wrongly scheduled in the same block as its output phi node (`91 Phi`). After this changeset **(after)**, `135 ClearArray` is correctly scheduled above the phi node, in its corresponding control path. Since `135 ClearArray` is pinned to the block projection `151 True`, a new block is created between `151 True` and `91 Phi` to accommodate it. > > ![fix](https://user-images.githubusercontent.com/8792647/165956029-8e8bae8c-d836-444c-8861-2c13f52c22c6.png) > > Additionally, the changeset introduces checks on graph invariants that are assumed by scheduling approximation (e.g. each block projection has a single control successor), warning the IGV user if these invariants are broken. Warning and gracefully degrading the approximated schedule is preferred to just failing since one of IGV's main use cases is debugging graphs which might be ill-formed. The warnings are reported both textually in the IGV log and visually for each node, if the corresponding filter ("Show node warnings") is active: > > ![warning](https://user-images.githubusercontent.com/8792647/165957171-50c2bcb9-0247-45cc-b806-c4e811996ce4.png) > > Node warnings are implemented as a general filter and can be used in custom filters for other purposes, for example highlighting nodes that match a certain property of interest. > > #### Testing > > ##### Functionality > > - Tested manually that phi inputs and pinned nodes are scheduled correctly for a few selected graphs (included the reported one). > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not trigger any assertion failure and does not warn with the message "Phi input that does not dominate the phi's input block". > > ##### Performance > > Measured that the scheduling time is not slowed down for a selection of 89 large graphs (2511-7329 nodes). The performance results are attached (note that each measurement in the sheet corresponds to the median of ten runs). This pull request has now been integrated. Changeset: 7a483517 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/7a4835178d58b132773fec77b923095e36d1dcec Stats: 431 lines in 8 files changed: 384 ins; 23 del; 24 mod 8280568: IGV: Phi inputs and pinned nodes are not scheduled correctly Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/7493 From duke at openjdk.java.net Tue May 3 07:33:13 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 3 May 2022 07:33:13 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v3] In-Reply-To: References: Message-ID: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - Adapted comments again - compacted comment, added test without -Djdk.test.lib.random.seed=42 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7889/files - new: https://git.openjdk.java.net/jdk/pull/7889/files/8c7ef702..15535e8e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=01-02 Stats: 8 lines in 2 files changed: 3 ins; 1 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7889.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7889/head:pull/7889 PR: https://git.openjdk.java.net/jdk/pull/7889 From thartmann at openjdk.java.net Tue May 3 07:33:14 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 3 May 2022 07:33:14 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v3] In-Reply-To: References: Message-ID: On Tue, 3 May 2022 07:30:12 GMT, Emanuel Peter wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - Adapted comments again > - compacted comment, added test without -Djdk.test.lib.random.seed=42 Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Tue May 3 07:33:14 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 3 May 2022 07:33:14 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v2] In-Reply-To: References: Message-ID: On Tue, 3 May 2022 07:10:59 GMT, Tobias Hartmann wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> Comments adapted to review > > test/hotspot/jtreg/compiler/intrinsics/unsafe/HeapByteBufferTest.java line 45: > >> 43: * @library /test/lib >> 44: * >> 45: * @run main/othervm -Djdk.test.lib.random.seed=42 > > Please add an additional run without a fixed seed. @TobiHartmann Done and tested ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From thartmann at openjdk.java.net Tue May 3 07:33:14 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 3 May 2022 07:33:14 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v2] In-Reply-To: References: Message-ID: <0NOcxppKqTxVh2AHGxR8-PJbXfxuhwHjqvDmKo2RI3Y=.7d8e208f-b052-45ae-9a11-e77a48f164b1@github.com> On Tue, 3 May 2022 07:23:39 GMT, Emanuel Peter wrote: >> test/hotspot/jtreg/compiler/intrinsics/unsafe/HeapByteBufferTest.java line 45: >> >>> 43: * @library /test/lib >>> 44: * >>> 45: * @run main/othervm -Djdk.test.lib.random.seed=42 >> >> Please add an additional run without a fixed seed. > > @TobiHartmann Done and tested Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From rcastanedalo at openjdk.java.net Tue May 3 07:46:24 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 3 May 2022 07:46:24 GMT Subject: Integrated: 8283684: IGV: speed up filter application In-Reply-To: References: Message-ID: On Fri, 1 Apr 2022 10:19:21 GMT, Roberto Casta?eda Lozano wrote: > This change improves view creation time by creating a single JavaScript engine shared among all filters, rather than creating an engine every time a filter is applied. Since creating a JavaScript engine is a costly operation, this change speeds up view creation substantially for small and medium-sized graphs as soon as any filter is applied. This includes the default IGV configuration, where the "Color by category" filter is enabled. > > #### Testing > > ##### Functionality > > - Tested manually applying different filter subsets and differing on a small selection of graphs, for JDK 11 and 17 (which use different versions and ways of packaging the JavaScript engine). > > - Tested automatically viewing thousands of graphs with different subsets of filters enabled (by instrumenting IGV to view graphs as they are loaded and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`). > > ##### Performance > > Measured the view creation time for the default sea-of-nodes view on a selection of 94 medium-sized graphs (200-493 nodes) for different subsets of filters. Before the change, the view creating time increases roughly linearly with the number of applied filters (since an engine is created for each filter application). After the change, the view creating time remains roughly constant (even slightly decreasing) as the number of applied filters increases, yielding an average speedup of 2.4x for the default IGV configuration, and up to 8.2x when five filters are applied. The speedup is expected to diminish for larger graphs where engine creation does not dominate view creation time. The complete results are [attached](https://github.com/openjdk/jdk/files/8396784/performance-evaluation.ods) (note that each measurement in the sheet corresponds to the median of ten runs). This pull request has now been integrated. Changeset: af1ee1cc Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/af1ee1cc5576c0b247c543510ca8be7e23d805f1 Stats: 113 lines in 3 files changed: 57 ins; 43 del; 13 mod 8283684: IGV: speed up filter application Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8073 From duke at openjdk.java.net Tue May 3 08:48:02 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 3 May 2022 08:48:02 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v4] In-Reply-To: References: Message-ID: <9qXXj4alHvQW7_ihat0HLYnKBV_ZhylC8r1Cahx_7tc=.88e369a2-a280-4723-9532-6f4d3bd0b65e@github.com> > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - kill trailing whitespaces - refactoring, and making edge its own colunm ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/da91b4ec..b0cd3044 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=02-03 Stats: 102 lines in 1 file changed: 41 ins; 46 del; 15 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From rcastanedalo at openjdk.java.net Tue May 3 11:12:14 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 3 May 2022 11:12:14 GMT Subject: Integrated: 8279622: C2: miscompilation of map pattern as a vector reduction In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 08:02:07 GMT, Roberto Casta?eda Lozano wrote: > The node reduction flag (`Node::Flag_is_reduction`) is only valid as long as the node remains within the reduction loop in which it was originally marked. This changeset ensures that reduction nodes are unmarked as such if they are extracted out of their associated reduction loop by the peel/main/post loop transformation (`PhaseIdealLoop::insert_pre_post_loops()`). This prevents SLP from wrongly vectorizing, as parallel reductions, outer non-reduction loops to which reduction nodes have been extracted. A more detailed analysis of the failure is available in the [JBS bug report](https://bugs.openjdk.java.net/browse/JDK-8279622). > > The issue could be alternatively fixed at the IGVN level by unmarking reduction nodes as soon as they are decoupled from their corresponding phi and counted loop nodes, but the fix proposed here is simpler and less intrusive. > > The changeset also introduces an assertion at the use point (`SuperWord::transform_loop()`) to check that loops containing reduction nodes are marked as reductions. This invariant could be alternatively placed together with other assertions under `-XX:+VerifyLoopOptimizations`, but [this option is known to be broken](https://bugs.openjdk.java.net/browse/JDK-8173709). > > IR verification using the IR test framework is not feasible for the proposed test case, since the failure is triggered on a OSR compilation, [for which IR verification does not seem to be supported](https://github.com/openjdk/jdk/blob/e7c3b9de649d4b28ba16844e042afcf3c89323e5/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/parser/Line.java#L56-L58). The assertion described above compensates this limitation. > > #### Testing > > ##### Functionality > > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier4-7 (linux-x64; debug mode). > > ##### Performance > > - No significant regression on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on windows-x64, linux-x64, linux-aarch64, and macosx-x64. > - No significant difference in generated number of vector instructions when comparing the output of `compiler/vectorization` and `compiler/loopopts/superword` tests using `-XX:+TraceNewVectors` on linux-x64. This pull request has now been integrated. Changeset: 6fcd3222 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/6fcd322258e0cce3724a4a8dc18f7802018a7cc9 Stats: 95 lines in 5 files changed: 95 ins; 0 del; 0 mod 8279622: C2: miscompilation of map pattern as a vector reduction Reviewed-by: roland, kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8464 From stuefe at openjdk.java.net Tue May 3 11:21:19 2022 From: stuefe at openjdk.java.net (Thomas Stuefe) Date: Tue, 3 May 2022 11:21:19 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 14:35:28 GMT, Lutz Schmidt wrote: >> Please review (and approve, if possible) this pull request. >> >> This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. >> >> Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. >> >> @backwaterred Could you please conduct some "official" testing for this PR? >> >> Thank you all! >> >> Note: some performance figures can be found in the JBS ticket. > > Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: > > 8278757: add clarifying comments Looks good to my non-mainframey eyes. Impressive work. Cheers, Thomas ------------- Marked as reviewed by stuefe (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8142 From jbhateja at openjdk.java.net Tue May 3 13:13:18 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 3 May 2022 13:13:18 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v3] In-Reply-To: References: Message-ID: On Tue, 3 May 2022 07:33:13 GMT, Emanuel Peter wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - Adapted comments again > - compacted comment, added test without -Djdk.test.lib.random.seed=42 LGTM src/hotspot/share/opto/chaitin.cpp line 1736: > 1734: // src to cisc, else we might schedule cisc before src, loading from a > 1735: // spill location before storing the spill. > 1736: cisc->add_prec(src); An assertion for null memory operand can be added before adding presidence edge. ------------- Marked as reviewed by jbhateja (Committer). PR: https://git.openjdk.java.net/jdk/pull/7889 From dnsimon at openjdk.java.net Tue May 3 14:14:33 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Tue, 3 May 2022 14:14:33 GMT Subject: RFR: 8286063: check compiler queue after calling AbstractCompiler::on_empty_queue Message-ID: [JDK-8242440](https://bugs.openjdk.java.net/browse/JDK-8242440) added support for a JIT compiler to be notified when a `CompilerThread` has an empty compilation queue. It's possible for an implementation of `AbstractCompiler::on_empty_queue` to temporarily release `MethodCompileQueue_lock` (e.g. [here](https://github.com/openjdk/jdk/blob/357b1b18c20233f16fba872b79237e9459f5ba43/src/hotspot/share/jvmci/jvmciCompiler.cpp#L174)). This means a non-CompilerThread has a chance to enqueue a new compilation task. As such, the `CompilerThread` should check for this after calling `AbstractCompiler::on_empty_queue`. ------------- Commit messages: - check compilation queue after calling on_empty_queue Changes: https://git.openjdk.java.net/jdk/pull/8517/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8517&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286063 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8517.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8517/head:pull/8517 PR: https://git.openjdk.java.net/jdk/pull/8517 From kvn at openjdk.java.net Tue May 3 14:36:19 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 14:36:19 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: <38bSym85oG0FoDQw7DJvQpfGQX1NzRp5w1QMf3sLdqw=.2d29288a-37aa-4fb7-8c8e-35e878f988d9@github.com> On Tue, 3 May 2022 01:20:48 GMT, Xin Liu wrote: > Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. Agree. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8513 From kvn at openjdk.java.net Tue May 3 14:37:21 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 14:37:21 GMT Subject: RFR: 8285885: Replay compilation fails with assert(is_valid()) failed: check invoke In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:46:22 GMT, Dean Long wrote: > This change makes replay more tolerant so it will fail gracefully instead of assert if it can't find an invoke bytecode at the desired bci. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8514 From kvn at openjdk.java.net Tue May 3 14:39:11 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 14:39:11 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v4] In-Reply-To: <9qXXj4alHvQW7_ihat0HLYnKBV_ZhylC8r1Cahx_7tc=.88e369a2-a280-4723-9532-6f4d3bd0b65e@github.com> References: <9qXXj4alHvQW7_ihat0HLYnKBV_ZhylC8r1Cahx_7tc=.88e369a2-a280-4723-9532-6f4d3bd0b65e@github.com> Message-ID: <_1zszWG6IQGzd9ETGO1Vco5K3OPx5JRssiAw5qJ8MRQ=.46a710af-14fd-47e4-9701-df3eba371f20@github.com> On Tue, 3 May 2022 08:48:02 GMT, Emanuel Peter wrote: >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. >> >> `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> **1. Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). >> 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! >> 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. >> 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. >> >> Example: >> >> (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") >> No target: perform BFS. >> dis par c dump >> --------------------------------------------- >> 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] >> 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] >> 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL >> 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] >> 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] >> >> >> Example with Mach nodes: >> >> (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") >> No target: perform BFS. >> dis [head idom d] old par c dump >> --------------------------------------------- >> 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] >> 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] >> 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] >> 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] >> 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] >> 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] >> 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] >> 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] >> >> >> **2. Find loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "cox+")` >> This provides us with a shortest path, given this path has a distance of at most 20. >> >> Example: >> >> (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") >> Find shortest path: 158 -> 160. >> >> Backtrace target. >> dis c dump >> --------------------------------------------- >> 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] >> 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] >> 7 c 363 If === 358 351 [[ 364 367 ]] >> 6 c 364 IfTrue === 363 [[ 128 ]] >> 5 c 128 If === 364 127 [[ 129 130 ]] >> 4 c 129 IfTrue === 128 [[ 155 ]] >> 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] >> 2 c 157 IfFalse === 155 [[ 162 163 ]] >> 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] >> 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] >> >> Example with Mach nodes: >> >> (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") >> Find shortest path: 159 -> 27. >> >> Backtrace target. >> dis [head idom d] old e c dump >> --------------------------------------------- >> 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] >> 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] >> 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] > > Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: > > - kill trailing whitespaces > - refactoring, and making edge its own colunm Nice. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8468 From kvn at openjdk.java.net Tue May 3 14:46:21 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 14:46:21 GMT Subject: RFR: 8286063: check compiler queue after calling AbstractCompiler::on_empty_queue In-Reply-To: References: Message-ID: On Tue, 3 May 2022 14:07:30 GMT, Doug Simon wrote: > [JDK-8242440](https://bugs.openjdk.java.net/browse/JDK-8242440) added support for a JIT compiler to be notified when a `CompilerThread` has an empty compilation queue. It's possible for an implementation of `AbstractCompiler::on_empty_queue` to temporarily release `MethodCompileQueue_lock` (e.g. [here](https://github.com/openjdk/jdk/blob/357b1b18c20233f16fba872b79237e9459f5ba43/src/hotspot/share/jvmci/jvmciCompiler.cpp#L174)). This means a non-CompilerThread has a chance to enqueue a new compilation task. As such, the `CompilerThread` should check for this after calling `AbstractCompiler::on_empty_queue`. src/hotspot/share/compiler/compileBroker.cpp line 450: > 448: // so check again whether any tasks were added to the queue. > 449: break; > 450: } So this is just optimization to avoid waiting 5 sec? ------------- PR: https://git.openjdk.java.net/jdk/pull/8517 From thartmann at openjdk.java.net Tue May 3 15:13:21 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 3 May 2022 15:13:21 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:20:48 GMT, Xin Liu wrote: > Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. Looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8513 From simonis at openjdk.java.net Tue May 3 15:30:22 2022 From: simonis at openjdk.java.net (Volker Simonis) Date: Tue, 3 May 2022 15:30:22 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:20:48 GMT, Xin Liu wrote: > Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. Looks good. Thanks for fixing this, Xin. ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From simonis at openjdk.java.net Tue May 3 15:44:19 2022 From: simonis at openjdk.java.net (Volker Simonis) Date: Tue, 3 May 2022 15:44:19 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:20:48 GMT, Xin Liu wrote: > Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. test/hotspot/jtreg/compiler/exceptions/OptimizeImplicitExceptions.java line 200: > 198: // The following options are both develop, getBooleanVMFlag() returns NULL in product build. > 199: // If they are set in debug build, disable them for more test stability. > 200: if (WB.getBooleanVMFlag("DeoptimizeALot")) { If this really returns NULL in a product build, wont this cause an exception then? Wouldn't it be easier to just unconditionally set them both to false: WB.setBooleanVMFlag("DeoptimizeALot", false); WB.setBooleanVMFlag("DeoptimizeRandom", false); ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From thartmann at openjdk.java.net Tue May 3 15:54:20 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 3 May 2022 15:54:20 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: On Tue, 3 May 2022 15:36:36 GMT, Volker Simonis wrote: >> Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. > > test/hotspot/jtreg/compiler/exceptions/OptimizeImplicitExceptions.java line 200: > >> 198: // The following options are both develop, getBooleanVMFlag() returns NULL in product build. >> 199: // If they are set in debug build, disable them for more test stability. >> 200: if (WB.getBooleanVMFlag("DeoptimizeALot")) { > > If this really returns NULL in a product build, wont this cause an exception then? > > Wouldn't it be easier to just unconditionally set them both to false: > > WB.setBooleanVMFlag("DeoptimizeALot", false); > WB.setBooleanVMFlag("DeoptimizeRandom", false); If so, I think that should be done in the `@run` statement. ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From iveresov at openjdk.java.net Tue May 3 16:06:17 2022 From: iveresov at openjdk.java.net (Igor Veresov) Date: Tue, 3 May 2022 16:06:17 GMT Subject: RFR: 8265360: several compiler/whitebox tests fail with "private compiler.whitebox.SimpleTestCaseHelper(int) must be compiled" In-Reply-To: References: Message-ID: <3n4_Y3F_aNTg1GL14aFLmb9fK2kcf6E9FTJQX_-iy0s=.ee9c3b4c-2ac8-40e7-8bf7-839186c78c52@github.com> On Fri, 29 Apr 2022 21:13:21 GMT, Igor Veresov wrote: > The compilation policy uses the length of the queues as a feedback mechanism that gives us information about the compilation speed. In some places it makes decisions based on the queue length length alone without looking at the invocation counters. That can cause a starvation effect. For example when running in a C2-only mode it may delay profiling in the interpreter if the C2 queue is too long. The solution to this is detect "old" methods (that is method that have been used a lot) and force putting them into the queue and let the queue prioritization deal with it. > > I also did some cleanup for things that got in the way. > Testing looks clean. Yes, this fixes both. Thanks guys for the reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8473 From iveresov at openjdk.java.net Tue May 3 16:06:17 2022 From: iveresov at openjdk.java.net (Igor Veresov) Date: Tue, 3 May 2022 16:06:17 GMT Subject: Integrated: 8265360: several compiler/whitebox tests fail with "private compiler.whitebox.SimpleTestCaseHelper(int) must be compiled" In-Reply-To: References: Message-ID: <62tcQNe1yffGBV9CP21bVMV2196Iv4SuXv6MaA9ER2Q=.fe5a9df3-a14f-4826-9e3e-c4c117cb8c19@github.com> On Fri, 29 Apr 2022 21:13:21 GMT, Igor Veresov wrote: > The compilation policy uses the length of the queues as a feedback mechanism that gives us information about the compilation speed. In some places it makes decisions based on the queue length length alone without looking at the invocation counters. That can cause a starvation effect. For example when running in a C2-only mode it may delay profiling in the interpreter if the C2 queue is too long. The solution to this is detect "old" methods (that is method that have been used a lot) and force putting them into the queue and let the queue prioritization deal with it. > > I also did some cleanup for things that got in the way. > Testing looks clean. This pull request has now been integrated. Changeset: 4434c7df Author: Igor Veresov URL: https://git.openjdk.java.net/jdk/commit/4434c7df036a2b2ffff54b8b19943de3c23a4e52 Stats: 90 lines in 5 files changed: 32 ins; 8 del; 50 mod 8265360: several compiler/whitebox tests fail with "private compiler.whitebox.SimpleTestCaseHelper(int) must be compiled" Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8473 From xliu at openjdk.java.net Tue May 3 16:51:22 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 3 May 2022 16:51:22 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Mon, 2 May 2022 20:28:22 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) >> Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics src/hotspot/share/opto/macro.cpp line 2608: > 2606: } > 2607: > 2608: int PhaseMacroExpand::count_MemBar() { I am not sure about this procedural. Even though you use Unique_Node_List, is it still possible to count the same membar multiple times? or maybe a backedge cause an infinitely loop? I think you can use Compile::identify_useful_nodes to collect all useful nodes and then count MemBar nodes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Tue May 3 16:51:24 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 16:51:24 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 05:10:44 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Correct feature checks in some assembler move instruction. >> - Explicitly pass opmask register in routines accepting merge argument. >> - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. >> - Add missing encoding based move elision checks in some macro assembly routines. >> >> Kindly review and share your feedback. >> >> Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 > - 8284813: x86 Code cleanup related to move instructions. Good. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From dnsimon at openjdk.java.net Tue May 3 17:06:32 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Tue, 3 May 2022 17:06:32 GMT Subject: RFR: 8286063: check compiler queue after calling AbstractCompiler::on_empty_queue In-Reply-To: References: Message-ID: On Tue, 3 May 2022 14:42:28 GMT, Vladimir Kozlov wrote: >> [JDK-8242440](https://bugs.openjdk.java.net/browse/JDK-8242440) added support for a JIT compiler to be notified when a `CompilerThread` has an empty compilation queue. It's possible for an implementation of `AbstractCompiler::on_empty_queue` to temporarily release `MethodCompileQueue_lock` (e.g. [here](https://github.com/openjdk/jdk/blob/357b1b18c20233f16fba872b79237e9459f5ba43/src/hotspot/share/jvmci/jvmciCompiler.cpp#L174)). This means a non-CompilerThread has a chance to enqueue a new compilation task. As such, the `CompilerThread` should check for this after calling `AbstractCompiler::on_empty_queue`. > > src/hotspot/share/compiler/compileBroker.cpp line 450: > >> 448: // so check again whether any tasks were added to the queue. >> 449: break; >> 450: } > > So this is just optimization to avoid waiting 5 sec? Yes. Normally it does not matter but in `-Xbatch` mode it means one compilation every 5 seconds and tests like `TestTrichotomyExpressions` time out. ------------- PR: https://git.openjdk.java.net/jdk/pull/8517 From kvn at openjdk.java.net Tue May 3 17:27:10 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 17:27:10 GMT Subject: RFR: 8286063: check compiler queue after calling AbstractCompiler::on_empty_queue In-Reply-To: References: Message-ID: On Tue, 3 May 2022 14:07:30 GMT, Doug Simon wrote: > [JDK-8242440](https://bugs.openjdk.java.net/browse/JDK-8242440) added support for a JIT compiler to be notified when a `CompilerThread` has an empty compilation queue. It's possible for an implementation of `AbstractCompiler::on_empty_queue` to temporarily release `MethodCompileQueue_lock` (e.g. [here](https://github.com/openjdk/jdk/blob/357b1b18c20233f16fba872b79237e9459f5ba43/src/hotspot/share/jvmci/jvmciCompiler.cpp#L174)). This means a non-CompilerThread has a chance to enqueue a new compilation task. As such, the `CompilerThread` should check for this after calling `AbstractCompiler::on_empty_queue`. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8517 From kvn at openjdk.java.net Tue May 3 17:47:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 17:47:35 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: Message-ID: <4_Xq6akDYMECPMDCCJ_7x_HRKUBYRS9HhMvjTcyZ2Dc=.c8442261-7651-40f9-abda-125e65a8fc68@github.com> On Fri, 29 Apr 2022 05:10:44 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Correct feature checks in some assembler move instruction. >> - Explicitly pass opmask register in routines accepting merge argument. >> - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. >> - Add missing encoding based move elision checks in some macro assembly routines. >> >> Kindly review and share your feedback. >> >> Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 > - 8284813: x86 Code cleanup related to move instructions. I found that Tobias already ran tier1-4 testing and it passed. I am running tier1 with latest JDK. You need second review. Could be from Intel. ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From xliu at openjdk.java.net Tue May 3 19:02:17 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 3 May 2022 19:02:17 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Mon, 2 May 2022 20:28:22 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) >> Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics hi, Thank you for working on this. In generally, I think it looks good. only small suggests left. src/hotspot/share/opto/compile.cpp line 2215: > 2213: #ifndef PRODUCT > 2214: Atomic::add(&ConnectionGraph::_no_escape_counter, total_scalar_replaced); > 2215: Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr); you can merge two atomic operations into one. Atomic::add(&ConnectionGraph::_no_escape_counter(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced); This is smart. last revision is the final revision. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Tue May 3 19:48:27 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 3 May 2022 19:48:27 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 05:10:44 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Correct feature checks in some assembler move instruction. >> - Explicitly pass opmask register in routines accepting merge argument. >> - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. >> - Add missing encoding based move elision checks in some macro assembly routines. >> >> Kindly review and share your feedback. >> >> Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 > - 8284813: x86 Code cleanup related to move instructions. My tests passed. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8230 From dlong at openjdk.java.net Tue May 3 21:03:20 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 3 May 2022 21:03:20 GMT Subject: RFR: 8285885: Replay compilation fails with assert(is_valid()) failed: check invoke In-Reply-To: References: Message-ID: <1YuEJfBFWaQlCJ0jGMXrdjmLWOF7j9QaJbHFTvCprB8=.1821589e-a2c9-456a-8df0-16b8fccf4a84@github.com> On Tue, 3 May 2022 01:46:22 GMT, Dean Long wrote: > This change makes replay more tolerant so it will fail gracefully instead of assert if it can't find an invoke bytecode at the desired bci. Thanks Tobias and Vladimir. ------------- PR: https://git.openjdk.java.net/jdk/pull/8514 From dlong at openjdk.java.net Tue May 3 21:03:20 2022 From: dlong at openjdk.java.net (Dean Long) Date: Tue, 3 May 2022 21:03:20 GMT Subject: Integrated: 8285885: Replay compilation fails with assert(is_valid()) failed: check invoke In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:46:22 GMT, Dean Long wrote: > This change makes replay more tolerant so it will fail gracefully instead of assert if it can't find an invoke bytecode at the desired bci. This pull request has now been integrated. Changeset: f82dd766 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/f82dd76614013afdbc69853f5a1943fcdcd3b55b Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod 8285885: Replay compilation fails with assert(is_valid()) failed: check invoke Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8514 From sviswanathan at openjdk.java.net Tue May 3 22:39:52 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 3 May 2022 22:39:52 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: Message-ID: <03wdHrmXfp_Q6ZnotXUTjWoDqFx_TB6giFrymCGJP9I=.a6ae20f7-6041-4d8a-a0ea-2dbb729b5963@github.com> On Fri, 29 Apr 2022 05:10:44 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Correct feature checks in some assembler move instruction. >> - Explicitly pass opmask register in routines accepting merge argument. >> - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. >> - Add missing encoding based move elision checks in some macro assembly routines. >> >> Kindly review and share your feedback. >> >> Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 > - 8284813: x86 Code cleanup related to move instructions. src/hotspot/cpu/x86/assembler_x86.cpp line 3032: > 3030: attributes.reset_is_clear_context(); > 3031: } > 3032: int encode = vex_prefix_and_encode(dst->encoding(), 0, src->encoding(), VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); The existing version (with no mask) was using VEX_SIMD_F2 or VEX_SIMD_F3 based on avx512bw supported or not. With this change now the calling place need to be fixed. One place I see this being used is loadIotaIndices(). Please fix loadIotaIndices to use appropriate instruction for the platform. Is there any other place in array copy/clear? src/hotspot/cpu/x86/macroAssembler_x86_arrayCopy_avx3.cpp line 202: > 200: bzhiq(temp, temp, length); > 201: kmovql(mask, temp); > 202: evmovdqu(type[shift], mask, xmm, Address(src, index, scale, offset), true, Assembler::AVX_512bit); Should the merge parameter be set to false for load here? src/hotspot/cpu/x86/macroAssembler_x86_arrayCopy_avx3.cpp line 217: > 215: bzhiq(temp, temp, length); > 216: kmovql(mask, temp); > 217: evmovdqu(type[shift], mask, xmm, Address(src, index, scale, offset), true, Assembler::AVX_256bit); Should the merge parameter be set to false for load here? ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From duke at openjdk.java.net Wed May 4 01:19:28 2022 From: duke at openjdk.java.net (aamarsh) Date: Wed, 4 May 2022 01:19:28 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Tue, 3 May 2022 18:50:50 GMT, Xin Liu wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/compile.cpp line 2215: > >> 2213: #ifndef PRODUCT >> 2214: Atomic::add(&ConnectionGraph::_no_escape_counter, total_scalar_replaced); >> 2215: Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr); > > you can merge two atomic operations into one. Atomic::add(&ConnectionGraph::_no_escape_counter(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced); > > This is smart. last revision is the final revision. @navyxliu Thank you so much for your feedback, it is helpful for me :). As per this issue, since no nodes are ever removed from the list, I think this issue is avoided. If we find a loop back to a node we have already searched and attempt to push this node at line 2620, it will not be added to the list because it already exists. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Wed May 4 01:26:28 2022 From: duke at openjdk.java.net (aamarsh) Date: Wed, 4 May 2022 01:26:28 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Tue, 3 May 2022 16:46:20 GMT, Xin Liu wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/macro.cpp line 2608: > >> 2606: } >> 2607: >> 2608: int PhaseMacroExpand::count_MemBar() { > > I am not sure about this procedural. Even though you use Unique_Node_List, is it still possible to count the same membar multiple times? or maybe a backedge cause an infinitely loop? > > I think you can use Compile::identify_useful_nodes to collect all useful nodes and then count MemBar nodes. @navyxliu Since no nodes are ever removed from the list, I think this issue is avoided. If we find a loop back to a node we have already searched and attempt to push this node at line 2620, it will not be added to the list because it already exists. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Wed May 4 01:36:13 2022 From: duke at openjdk.java.net (aamarsh) Date: Wed, 4 May 2022 01:36:13 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) > Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: adding escape analysis and scalar replacement statistics ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/18328bb3..a2811a8f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=06-07 Stats: 2 lines in 1 file changed: 0 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Wed May 4 02:06:01 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 4 May 2022 02:06:01 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne Message-ID: Hi, This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from ucomiss xmm0, xmm0 jp label jne label into ucomiss xmm0, xmm0 jp label 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as xorl ecx, ecx ucomiss xmm0, xmm1 jnp done pushf andq [rsp], 0xffffff2b popf done: movl eax, 1 cmovenl eax, ecx The patch changes this sequence into xorl ecx, ecx ucomiss xmm0, xmm1 movl eax, 1 cmovpl eax, ecx cmovnel eax, ecx 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. The benchmark results are as follow: Before: Benchmark Mode Cnt Score Error Units FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op After: Benchmark Mode Cnt Score Error Units FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op Thank you very much. ------------- Commit messages: - fix tests - test - improve infinity - remove expensive rules - improve fp comparison Changes: https://git.openjdk.java.net/jdk/pull/8525/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8525&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285973 Stats: 657 lines in 8 files changed: 569 ins; 70 del; 18 mod Patch: https://git.openjdk.java.net/jdk/pull/8525.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8525/head:pull/8525 PR: https://git.openjdk.java.net/jdk/pull/8525 From dnsimon at openjdk.java.net Wed May 4 04:59:32 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Wed, 4 May 2022 04:59:32 GMT Subject: Integrated: 8286063: check compiler queue after calling AbstractCompiler::on_empty_queue In-Reply-To: References: Message-ID: <3HodYcugwGrwtmjlbDWKUeqXD1P4aMoWKDZK6o_r1rY=.f758bde2-5509-4d6b-89db-abd85b96fef5@github.com> On Tue, 3 May 2022 14:07:30 GMT, Doug Simon wrote: > [JDK-8242440](https://bugs.openjdk.java.net/browse/JDK-8242440) added support for a JIT compiler to be notified when a `CompilerThread` has an empty compilation queue. It's possible for an implementation of `AbstractCompiler::on_empty_queue` to temporarily release `MethodCompileQueue_lock` (e.g. [here](https://github.com/openjdk/jdk/blob/357b1b18c20233f16fba872b79237e9459f5ba43/src/hotspot/share/jvmci/jvmciCompiler.cpp#L174)). This means a non-CompilerThread has a chance to enqueue a new compilation task. As such, the `CompilerThread` should check for this after calling `AbstractCompiler::on_empty_queue`. This pull request has now been integrated. Changeset: 4282fb2b Author: Doug Simon URL: https://git.openjdk.java.net/jdk/commit/4282fb2b0d0e517d255be7c882c141722e9c9b46 Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod 8286063: check compiler queue after calling AbstractCompiler::on_empty_queue Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8517 From duke at openjdk.java.net Wed May 4 05:17:01 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 05:17:01 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v4] In-Reply-To: References: Message-ID: <9ibzoV61P9Nb7vPcFdT77THapiUSVG-Q8f77cYz4nnI=.75f5bf19-aff0-4c89-b8d6-ed4641122d4c@github.com> > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: add assert that jatin asked for ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7889/files - new: https://git.openjdk.java.net/jdk/pull/7889/files/15535e8e..664b63f1 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=02-03 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/7889.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7889/head:pull/7889 PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Wed May 4 05:17:02 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 05:17:02 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v3] In-Reply-To: References: Message-ID: On Tue, 3 May 2022 13:09:50 GMT, Jatin Bhateja wrote: >> Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: >> >> - Adapted comments again >> - compacted comment, added test without -Djdk.test.lib.random.seed=42 > > src/hotspot/share/opto/chaitin.cpp line 1736: > >> 1734: // src to cisc, else we might schedule cisc before src, loading from a >> 1735: // spill location before storing the spill. >> 1736: cisc->add_prec(src); > > An assertion for null memory operand can be added before adding presidence edge. @jatin-bhateja Thanks, I added it :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Wed May 4 05:21:16 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 05:21:16 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: <0SZNw-9p9QDyotJDq-E4piJ-U2Jxc3vRMozOeyGWqn8=.115cc0d3-2a36-40bb-ad45-c10032fd9a8b@github.com> References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> <0SZNw-9p9QDyotJDq-E4piJ-U2Jxc3vRMozOeyGWqn8=.115cc0d3-2a36-40bb-ad45-c10032fd9a8b@github.com> Message-ID: On Thu, 31 Mar 2022 05:49:47 GMT, Jatin Bhateja wrote: >> @jatin-bhateja >> For all other cases (eg `addI`, `convI2L`, etc ) we are currently using >> https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1729 >> This seems to add the dependencies before the inputs. But that depends on there being space before the inputs. >> That is why we check `cisc->oper_input_base() > 1` >> https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1727 >> >> All other cases where we have spilling (eg `addI`, `convI2L`, etc ), we have `cisc->oper_input_base() == 2`. >> I think `_in[0]` is for control, and `_in[1]` for the memory edge (in the other cases not including `MoveF2I` etc). >> `cisc->oper_input_base() == 2` in all other cases, because in `InstructForm::oper_input_base` we ask for `MatchNode::needs_ideal_memory_edge`. >> In all cases except for `MoveF2I` etc, we say we need a memory edge there. Now we don't have space to set a memory edge before the inputs, and we simply do not set one. >> >> So how do we work with this? Do I just remove `cisc->ins_req(1,src)` for all cases and alway use `add_req` as you have suggested? Or do I leave the other cases, and just make an else case and use `add_prec` there for `MoveF2I` etc? I am wondering what is a clean and consistent solution here. >> >> Do you know why we add the memory edge before the inputs in the other cases? If I use `add_prec` then that adds the memory edge after the inputs, correct? > >> Do you know why we add the memory edge before the inputs in the other cases? If I use `add_prec` then that adds the memory edge after the inputs, correct? > > Currently memory edges are being added for instructions which directly access memory, ADLC enforces this by scanning through Ideal nodes of a matcher pattern in top-down manner. > Almost all the machine nodes decorated with **Flag_is_cisc_alternate** flag access Load/Store IR in their selection patterns. Only exceptions[1][2][3] as you pointed out are the ones which perform load/store from stack locations. Thus all I am suggesting is for all such cases without doing many changes we can add the DEF_Spill precedence edge which gets added after all the inputs but will still constrain the scheduling order. > > Thus an instructions which has a CISC alternate but lacks memory_operand can be handled by adding a prescience edge. > > [1] MoveF2I_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } > [2] MoveI2F_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } > [3] MoveD2L_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } @jatin-bhateja thanks for bringing up the assertion, I implemented and ran many tests. @vnkozlov @TobiHartmann Is this also ok with you? ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From xliu at openjdk.java.net Wed May 4 05:27:25 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 4 May 2022 05:27:25 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:36:13 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) >> Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics It seems that you use git reset and push it forcely. That's why you only have one commit displayed in this PR. Good news is that Skara can still generate revisions. I think it's easier for people to keep track your changes incrementally. It's totally okay you keep committing to your branch. github will squash them when it merges a PR. please refrain from push by force. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From xliu at openjdk.java.net Wed May 4 05:56:26 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 4 May 2022 05:56:26 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: On Tue, 3 May 2022 15:51:00 GMT, Tobias Hartmann wrote: >> test/hotspot/jtreg/compiler/exceptions/OptimizeImplicitExceptions.java line 200: >> >>> 198: // The following options are both develop, getBooleanVMFlag() returns NULL in product build. >>> 199: // If they are set in debug build, disable them for more test stability. >>> 200: if (WB.getBooleanVMFlag("DeoptimizeALot")) { >> >> If this really returns NULL in a product build, wont this cause an exception then? >> >> Wouldn't it be easier to just unconditionally set them both to false: >> >> WB.setBooleanVMFlag("DeoptimizeALot", false); >> WB.setBooleanVMFlag("DeoptimizeRandom", false); > > If so, I think that should be done in the `@run` statement. I prefer Volker style rather than @run annotation. Here are my arguments. 1. They need an extra one `-XX:+IgnoreUnrecognizedVMOptions` because those 2 options are not valid in debug builds. 2. Current makefile does put JTREG="VM_OPTIONS=-XX:+DeoptimizeALot" before, `@run` options will overwrite them. We better off not reply on makefile's logic. 3. Writing them in code gives us an opportunity to write a comment. ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From thartmann at openjdk.java.net Wed May 4 06:04:14 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 4 May 2022 06:04:14 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v4] In-Reply-To: <9ibzoV61P9Nb7vPcFdT77THapiUSVG-Q8f77cYz4nnI=.75f5bf19-aff0-4c89-b8d6-ed4641122d4c@github.com> References: <9ibzoV61P9Nb7vPcFdT77THapiUSVG-Q8f77cYz4nnI=.75f5bf19-aff0-4c89-b8d6-ed4641122d4c@github.com> Message-ID: On Wed, 4 May 2022 05:17:01 GMT, Emanuel Peter wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > add assert that jatin asked for Changes requested by thartmann (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From thartmann at openjdk.java.net Wed May 4 06:04:15 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 4 May 2022 06:04:15 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v3] In-Reply-To: References: Message-ID: <0w4EqTYEcJpbf6N4WwHJca2dsVEhaaob14o4x8lLKr0=.b5e9d1a1-e4f4-4004-8d90-16cc713978cf@github.com> On Wed, 4 May 2022 05:12:24 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/chaitin.cpp line 1736: >> >>> 1734: // src to cisc, else we might schedule cisc before src, loading from a >>> 1735: // spill location before storing the spill. >>> 1736: cisc->add_prec(src); >> >> An assertion for null memory operand can be added before adding presidence edge. > > @jatin-bhateja Thanks, I added it :) `memory_operand()` returns a `MachOper*`, we should not compare it to `0` but to `nullptr` (see https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md). ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From xliu at openjdk.java.net Wed May 4 06:05:10 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 4 May 2022 06:05:10 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot [v2] In-Reply-To: References: Message-ID: <5SuVJWQQHFLGofmw50ZG5jr-xcGY8280fkW1O0vnfyE=.a5a69434-b41f-49d1-82ad-1d39ff328cb0@github.com> > Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: directly set them to false. this is easiser. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8513/files - new: https://git.openjdk.java.net/jdk/pull/8513/files/131a4ad1..b1c20571 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8513&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8513&range=00-01 Stats: 9 lines in 1 file changed: 0 ins; 5 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8513.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8513/head:pull/8513 PR: https://git.openjdk.java.net/jdk/pull/8513 From thartmann at openjdk.java.net Wed May 4 06:35:16 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 4 May 2022 06:35:16 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot [v2] In-Reply-To: <5SuVJWQQHFLGofmw50ZG5jr-xcGY8280fkW1O0vnfyE=.a5a69434-b41f-49d1-82ad-1d39ff328cb0@github.com> References: <5SuVJWQQHFLGofmw50ZG5jr-xcGY8280fkW1O0vnfyE=.a5a69434-b41f-49d1-82ad-1d39ff328cb0@github.com> Message-ID: On Wed, 4 May 2022 06:05:10 GMT, Xin Liu wrote: >> Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > directly set them to false. this is easiser. Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From thartmann at openjdk.java.net Wed May 4 06:35:16 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 4 May 2022 06:35:16 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot [v2] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 05:52:59 GMT, Xin Liu wrote: >> If so, I think that should be done in the `@run` statement. > > I prefer Volker style rather than @run annotation. Here are my arguments. > > 1. They need an extra one `-XX:+IgnoreUnrecognizedVMOptions` because those 2 options are not valid in debug builds. > 2. Current makefile does put JTREG="VM_OPTIONS=-XX:+DeoptimizeALot" before, `@run` options will overwrite them. We better off not reply on makefile's logic. > 3. Writing them in code gives us an opportunity to write a comment. Okay, fine with me. ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From simonis at openjdk.java.net Wed May 4 06:40:20 2022 From: simonis at openjdk.java.net (Volker Simonis) Date: Wed, 4 May 2022 06:40:20 GMT Subject: RFR: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot [v2] In-Reply-To: <5SuVJWQQHFLGofmw50ZG5jr-xcGY8280fkW1O0vnfyE=.a5a69434-b41f-49d1-82ad-1d39ff328cb0@github.com> References: <5SuVJWQQHFLGofmw50ZG5jr-xcGY8280fkW1O0vnfyE=.a5a69434-b41f-49d1-82ad-1d39ff328cb0@github.com> Message-ID: On Wed, 4 May 2022 06:05:10 GMT, Xin Liu wrote: >> Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. > > Xin Liu has updated the pull request incrementally with one additional commit since the last revision: > > directly set them to false. this is easiser. Looks good now. Thanks. ------------- Marked as reviewed by simonis (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8513 From duke at openjdk.java.net Wed May 4 07:24:15 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 07:24:15 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v3] In-Reply-To: <0w4EqTYEcJpbf6N4WwHJca2dsVEhaaob14o4x8lLKr0=.b5e9d1a1-e4f4-4004-8d90-16cc713978cf@github.com> References: <0w4EqTYEcJpbf6N4WwHJca2dsVEhaaob14o4x8lLKr0=.b5e9d1a1-e4f4-4004-8d90-16cc713978cf@github.com> Message-ID: On Wed, 4 May 2022 06:00:23 GMT, Tobias Hartmann wrote: >> @jatin-bhateja Thanks, I added it :) > > `memory_operand()` returns a `MachOper*`, we should not compare it to `0` but to `nullptr` (see https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md). @TobiHartmann thanks for the feedback, I had assumed it had the same return as `int InstructForm::memory_operand`, but that is not the case. Will change it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Wed May 4 07:34:02 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 07:34:02 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v5] In-Reply-To: References: Message-ID: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: compare ptr with nullptr, not zero ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7889/files - new: https://git.openjdk.java.net/jdk/pull/7889/files/664b63f1..2a7ebce6 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7889&range=03-04 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7889.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7889/head:pull/7889 PR: https://git.openjdk.java.net/jdk/pull/7889 From thartmann at openjdk.java.net Wed May 4 07:34:02 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 4 May 2022 07:34:02 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v5] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 07:29:48 GMT, Emanuel Peter wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > compare ptr with nullptr, not zero Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7889 From lucy at openjdk.java.net Wed May 4 09:42:23 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Wed, 4 May 2022 09:42:23 GMT Subject: RFR: 8278757: [s390] Implement AES Counter Mode Intrinsic [v9] In-Reply-To: References: Message-ID: On Mon, 25 Apr 2022 21:42:39 GMT, Tyler Steele wrote: >> Lutz Schmidt has updated the pull request incrementally with one additional commit since the last revision: >> >> 8278757: add clarifying comments > > I see I have missed a request or two to re-run these tests. Sorry to keep you waiting! The much anticipated s390x Tier1 tests are running now. Updates will appear below. > > --- > > > # newfailures.txt > compiler/c2/irTests/TestAutoVectorization2DArray.java > > > The results completed overnight. It's just our old friend that failed. Looking good @RealLucy. A big thank you to the tester (@backwaterred) and the Reviewers (@TheRealMDoerr, @tstuefe)! ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From lucy at openjdk.java.net Wed May 4 09:42:24 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Wed, 4 May 2022 09:42:24 GMT Subject: Integrated: 8278757: [s390] Implement AES Counter Mode Intrinsic In-Reply-To: References: Message-ID: On Thu, 7 Apr 2022 10:02:07 GMT, Lutz Schmidt wrote: > Please review (and approve, if possible) this pull request. > > This is a s390-only enhancement. It introduces the implementation of an AES-CTR intrinsic, making use of the specific s390 instruction for AES counter-mode encryption. > > Testing: SAP does no longer maintain a full build and test environment for s390. Testing is therefore limited to running some test suites (SPECjbb*, SPECjvm*) manually. But: identical code is contained in SAP's commercial product and thoroughly tested in that context. No issues were uncovered. > > @backwaterred Could you please conduct some "official" testing for this PR? > > Thank you all! > > Note: some performance figures can be found in the JBS ticket. This pull request has now been integrated. Changeset: 4e1e76ac Author: Lutz Schmidt URL: https://git.openjdk.java.net/jdk/commit/4e1e76acfb2ac6131297fcea282bb7f2ca556f0e Stats: 716 lines in 7 files changed: 683 ins; 5 del; 28 mod 8278757: [s390] Implement AES Counter Mode Intrinsic Reviewed-by: mdoerr, stuefe ------------- PR: https://git.openjdk.java.net/jdk/pull/8142 From shade at openjdk.java.net Wed May 4 10:23:23 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 4 May 2022 10:23:23 GMT Subject: RFR: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize [v7] In-Reply-To: <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> Message-ID: On Thu, 28 Apr 2022 10:43:59 GMT, Aleksey Shipilev wrote: >> See the discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `java/lang/instrument` >> - [x] Linux x86_32 fastdebug `java/lang/instrument` >> - [x] Linux AArch64 fastdebug `java/lang/instrument` >> - [x] Linux ARM32 fastdebug `java/lang/instrument` >> - [x] Linux PPC64 fastdebug `java/lang/instrument` >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux AArch64 fastdebug `tier1` >> - [x] Linux PPC64 fastdebug `tier1` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Fix RISC-V too > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Revert ARM32 checks > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Fixing failures in ARM32 > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Checking ARM32 code > - Use checked_cast > - Merge branch 'master' into JDK-8280003-c1-logical-and > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/9adc97d8...66448a5e I made some final checks before integration, and I cannot reproduce the ARM32 bug in current mainline anymore. @snazarkin, do you remember which test configuration failed? ------------- PR: https://git.openjdk.java.net/jdk/pull/7080 From snazarki at openjdk.java.net Wed May 4 11:10:17 2022 From: snazarki at openjdk.java.net (Sergey Nazarkin) Date: Wed, 4 May 2022 11:10:17 GMT Subject: RFR: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize [v5] In-Reply-To: References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> Message-ID: On Fri, 28 Jan 2022 11:12:41 GMT, Sergey Nazarkin wrote: >> Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: >> >> Revert ARM32 checks > > LGTM, but I'm not reviewer > I made some final checks before integration, and I cannot reproduce the ARM32 bug in current mainline anymore. @snazarkin, do you remember which test configuration failed? it was ARM32 with explicitly enabled TieredCompilation ------------- PR: https://git.openjdk.java.net/jdk/pull/7080 From duke at openjdk.java.net Wed May 4 14:14:24 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 14:14:24 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v5] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fix int / long issue, and improve idx printing for paths ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/b0cd3044..0393074e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=03-04 Stats: 23 lines in 1 file changed: 8 ins; 6 del; 9 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From lucy at openjdk.java.net Wed May 4 14:51:07 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Wed, 4 May 2022 14:51:07 GMT Subject: RFR: 8285733: [s390] Vector Instruction Emitters for element-wise access are broken Message-ID: Please review this rather simple pull request. It fixes some vector instruction emitters. The bugs had gone unnoticed so far because the emitters had not been used. Therefore, the fix bears no risk. Testing was performed with new code currently under development. ------------- Commit messages: - 8285733: [s390] Vector Instruction Emitters for element-wise access are broken Changes: https://git.openjdk.java.net/jdk/pull/8537/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8537&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285733 Stats: 20 lines in 1 file changed: 1 ins; 0 del; 19 mod Patch: https://git.openjdk.java.net/jdk/pull/8537.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8537/head:pull/8537 PR: https://git.openjdk.java.net/jdk/pull/8537 From kvn at openjdk.java.net Wed May 4 15:19:24 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 4 May 2022 15:19:24 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc [v5] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 07:34:02 GMT, Emanuel Peter wrote: >> Update: >> After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: >> Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. >> >> This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. >> >> The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. >> Ran larger test suite, all passed. >> >> --------- >> >> In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. >> Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. >> This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. >> (This happens intermittently, but the regression test did reproduce it at about a 10% rate) >> >> In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. >> >> The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. >> >> We saw 2 options to fix this issue: >> 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). >> 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). >> >> I picked option 2. >> Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. >> >> While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. >> >> There are a few open points for discussion: >> >> - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) >> - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. >> - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > compare ptr with nullptr, not zero I agree with assert. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7889 From shade at openjdk.java.net Wed May 4 15:40:37 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 4 May 2022 15:40:37 GMT Subject: RFR: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize [v7] In-Reply-To: <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> <7BkHqnHc-p24_yzG7Q7gOt2fLpl9bh3mzSZZGb9yMAU=.f6a54439-1474-46b0-aaa1-670e65da0b54@github.com> Message-ID: On Thu, 28 Apr 2022 10:43:59 GMT, Aleksey Shipilev wrote: >> See the discussion in the bug. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug `java/lang/instrument` >> - [x] Linux x86_32 fastdebug `java/lang/instrument` >> - [x] Linux AArch64 fastdebug `java/lang/instrument` >> - [x] Linux ARM32 fastdebug `java/lang/instrument` >> - [x] Linux PPC64 fastdebug `java/lang/instrument` >> - [x] Linux x86_64 fastdebug `tier1` >> - [x] Linux x86_32 fastdebug `tier1` >> - [x] Linux AArch64 fastdebug `tier1` >> - [x] Linux PPC64 fastdebug `tier1` > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 11 additional commits since the last revision: > > - Fix RISC-V too > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Revert ARM32 checks > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Fixing failures in ARM32 > - Merge branch 'master' into JDK-8280003-c1-logical-and > - Checking ARM32 code > - Use checked_cast > - Merge branch 'master' into JDK-8280003-c1-logical-and > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/bf48758f...66448a5e All right, this reproduces on ARM32: $ make run-test TEST=java/lang/instrument/GetObjectSizeIntrinsicsTest.java TEST_VM_OPTS="-XX:+TieredCompilation" This PR still resolves the issue. Therefore, I am integrating. ------------- PR: https://git.openjdk.java.net/jdk/pull/7080 From shade at openjdk.java.net Wed May 4 15:42:42 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 4 May 2022 15:42:42 GMT Subject: Integrated: 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize In-Reply-To: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> References: <4wfmxqeneC0qL6x2cFaMVp-AWoQVbognQdKjV_nx4_U=.40d443e8-d900-472c-857d-841efabebc3d@github.com> Message-ID: On Fri, 14 Jan 2022 11:05:45 GMT, Aleksey Shipilev wrote: > See the discussion in the bug. > > Additional testing: > - [x] Linux x86_64 fastdebug `java/lang/instrument` > - [x] Linux x86_32 fastdebug `java/lang/instrument` > - [x] Linux AArch64 fastdebug `java/lang/instrument` > - [x] Linux ARM32 fastdebug `java/lang/instrument` > - [x] Linux PPC64 fastdebug `java/lang/instrument` > - [x] Linux x86_64 fastdebug `tier1` > - [x] Linux x86_32 fastdebug `tier1` > - [x] Linux AArch64 fastdebug `tier1` > - [x] Linux PPC64 fastdebug `tier1` This pull request has now been integrated. Changeset: 7b7207a4 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/7b7207a45a2838823b42c9c7cb0a45a97996018a Stats: 25 lines in 8 files changed: 4 ins; 1 del; 20 mod 8280003: C1: Reconsider uses of logical_and immediates in LIRGenerator::do_getObjectSize Co-authored-by: Sergey Nazarkin Reviewed-by: snazarki, dlong, iveresov ------------- PR: https://git.openjdk.java.net/jdk/pull/7080 From duke at openjdk.java.net Wed May 4 16:56:36 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 4 May 2022 16:56:36 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v6] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: missed in last commit ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/0393074e..4b7e3b1d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=04-05 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From xliu at openjdk.java.net Wed May 4 18:04:35 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 4 May 2022 18:04:35 GMT Subject: Integrated: 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot In-Reply-To: References: Message-ID: On Tue, 3 May 2022 01:20:48 GMT, Xin Liu wrote: > Disable DeoptimizeALot and DeoptimizeRandom, 2 develop options when run this test for stability. This pull request has now been integrated. Changeset: c5a0687f Author: Xin Liu URL: https://git.openjdk.java.net/jdk/commit/c5a0687f80367a3a284dfd56781c371826264d3b Stats: 5 lines in 1 file changed: 5 ins; 0 del; 0 mod 8285976: compiler/exceptions/OptimizeImplicitExceptions.java can't pass with -XX:+DeoptimizeALot Reviewed-by: kvn, thartmann, simonis ------------- PR: https://git.openjdk.java.net/jdk/pull/8513 From jbhateja at openjdk.java.net Wed May 4 19:04:06 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 4 May 2022 19:04:06 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v3] In-Reply-To: References: Message-ID: <934g3GF5YaRahiZ55c79D7MeIZRNgHnW02LFgy1bN88=.08209581-4395-4eec-a821-53b4b72feef8@github.com> > Summary of changes: > > - Correct feature checks in some assembler move instruction. > - Explicitly pass opmask register in routines accepting merge argument. > - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. > - Add missing encoding based move elision checks in some macro assembly routines. > > Kindly review and share your feedback. > > Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8284813: Review comments resolutions. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8230/files - new: https://git.openjdk.java.net/jdk/pull/8230/files/0792195e..939c0317 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8230&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8230&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8230.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8230/head:pull/8230 PR: https://git.openjdk.java.net/jdk/pull/8230 From jbhateja at openjdk.java.net Wed May 4 19:04:10 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 4 May 2022 19:04:10 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 05:10:44 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Correct feature checks in some assembler move instruction. >> - Explicitly pass opmask register in routines accepting merge argument. >> - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. >> - Add missing encoding based move elision checks in some macro assembly routines. >> >> Kindly review and share your feedback. >> >> Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 > - 8284813: x86 Code cleanup related to move instructions. Hi @sviswa7 , your comments have been addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From jbhateja at openjdk.java.net Wed May 4 19:04:12 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 4 May 2022 19:04:12 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: <03wdHrmXfp_Q6ZnotXUTjWoDqFx_TB6giFrymCGJP9I=.a6ae20f7-6041-4d8a-a0ea-2dbb729b5963@github.com> References: <03wdHrmXfp_Q6ZnotXUTjWoDqFx_TB6giFrymCGJP9I=.a6ae20f7-6041-4d8a-a0ea-2dbb729b5963@github.com> Message-ID: On Tue, 3 May 2022 22:36:07 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284813 >> - 8284813: x86 Code cleanup related to move instructions. > > src/hotspot/cpu/x86/assembler_x86.cpp line 3032: > >> 3030: attributes.reset_is_clear_context(); >> 3031: } >> 3032: int encode = vex_prefix_and_encode(dst->encoding(), 0, src->encoding(), VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); > > The existing version (with no mask) was using VEX_SIMD_F2 or VEX_SIMD_F3 based on avx512bw supported or not. With this change now the calling place need to be fixed. One place I see this being used is loadIotaIndices(). Please fix loadIotaIndices to use appropriate instruction for the platform. Is there any other place in array copy/clear? vector load operations in load_iota_indices are sensitive to vector length, a 64 byte iota values are loaded only for ByteVector.SPECIES_512 which necessitates existence of AVX512BW feature, I re-checked that copy/fill kernels use appropriate instructions. > src/hotspot/cpu/x86/macroAssembler_x86_arrayCopy_avx3.cpp line 217: > >> 215: bzhiq(temp, temp, length); >> 216: kmovql(mask, temp); >> 217: evmovdqu(type[shift], mask, xmm, Address(src, index, scale, offset), true, Assembler::AVX_256bit); > > Should the merge parameter be set to false for load here? DONE. ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From psandoz at openjdk.java.net Wed May 4 19:13:27 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 4 May 2022 19:13:27 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 11:03:48 GMT, Jatin Bhateja wrote: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp line 1340: > 1338: assert_different_registers(dst, src, vtmp1, vtmp2, vtmp3, vtmp4); > 1339: assert_different_registers(mask, ptmp, pgtmp); > 1340: // Example input: src = 88 77 66 45 44 33 22 11 Suggestion: // Example input: src = 88 77 66 55 44 33 22 11 ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From kvn at openjdk.java.net Wed May 4 19:59:18 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 4 May 2022 19:59:18 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:59:17 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 > > 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from > > ucomiss xmm0, xmm0 > jp label > jne label > > into > > ucomiss xmm0, xmm0 > jp label > > 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as > > xorl ecx, ecx > ucomiss xmm0, xmm1 > jnp done > pushf > andq [rsp], 0xffffff2b > popf > done: > movl eax, 1 > cmovel eax, ecx > > The patch changes this sequence into > > xorl ecx, ecx > ucomiss xmm0, xmm1 > movl eax, 1 > cmovpl eax, ecx > cmovnel eax, ecx > > 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. > > The benchmark results are as follow: > > Before: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op > FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op > FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op > FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op > FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op > FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op > FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op > FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op > > After: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op > FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op > FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op > FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op > FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op > FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op > FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op > FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op > > Thank you very much. Thank you for including tests. But you need additional other float compare (not just `ea` `ne`) tests since you removed some of `Cmp` instructions. You need approval from core libs for Java methods changes (it affects all platforms). And we will get intrinsics for them soon (I think): #8459. Please add comments to `cmpOpU*` operands explaining changes in them. You did not explain removal of some float compare instructions. src/hotspot/cpu/x86/x86_64.ad line 6998: > 6996: ins_encode %{ > 6997: __ cmovl(Assembler::parity, $dst$$Register, $src$$Register); > 6998: __ cmovl(Assembler::notEqual, $dst$$Register, $src$$Register); Should this be `equal`? ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8525 From sviswanathan at openjdk.java.net Wed May 4 20:23:14 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 4 May 2022 20:23:14 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v3] In-Reply-To: <934g3GF5YaRahiZ55c79D7MeIZRNgHnW02LFgy1bN88=.08209581-4395-4eec-a821-53b4b72feef8@github.com> References: <934g3GF5YaRahiZ55c79D7MeIZRNgHnW02LFgy1bN88=.08209581-4395-4eec-a821-53b4b72feef8@github.com> Message-ID: On Wed, 4 May 2022 19:04:06 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Correct feature checks in some assembler move instruction. >> - Explicitly pass opmask register in routines accepting merge argument. >> - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. >> - Add missing encoding based move elision checks in some macro assembly routines. >> >> Kindly review and share your feedback. >> >> Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8284813: Review comments resolutions. Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From sviswanathan at openjdk.java.net Wed May 4 20:23:16 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 4 May 2022 20:23:16 GMT Subject: RFR: 8284813: x86 Code cleanup related to move instructions. [v2] In-Reply-To: References: <03wdHrmXfp_Q6ZnotXUTjWoDqFx_TB6giFrymCGJP9I=.a6ae20f7-6041-4d8a-a0ea-2dbb729b5963@github.com> Message-ID: On Wed, 4 May 2022 18:57:07 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/assembler_x86.cpp line 3032: >> >>> 3030: attributes.reset_is_clear_context(); >>> 3031: } >>> 3032: int encode = vex_prefix_and_encode(dst->encoding(), 0, src->encoding(), VEX_SIMD_F2, VEX_OPCODE_0F, &attributes); >> >> The existing version (with no mask) was using VEX_SIMD_F2 or VEX_SIMD_F3 based on avx512bw supported or not. With this change now the calling place need to be fixed. One place I see this being used is loadIotaIndices(). Please fix loadIotaIndices to use appropriate instruction for the platform. Is there any other place in array copy/clear? > > vector load operations in load_iota_indices are sensitive to vector length, a 64 byte iota values are loaded only for ByteVector.SPECIES_512 which necessitates existence of AVX512BW feature, I re-checked that copy/fill kernels use appropriate instructions. Thanks for checking. ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From sviswanathan at openjdk.java.net Wed May 4 22:27:18 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 4 May 2022 22:27:18 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 In-Reply-To: References: Message-ID: On Mon, 2 May 2022 08:19:53 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin src/hotspot/cpu/x86/x86.ad line 6191: > 6189: %} > 6190: > 6191: instruct compressBitsL_reg(rRegL dst, rRegL src, rRegL mask) %{ All the compress/expand rules could be moved to x86_64.ad. src/hotspot/share/opto/intrinsicnode.cpp line 160: > 158: > 159: Node* compress_expand_identity(PhaseGVN* phase, Node* n) { > 160: BasicType bt = n->bottom_type()->array_element_basic_type(); Why use of array_element_basic_type() here? These are not arrays. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From psandoz at openjdk.java.net Wed May 4 22:31:25 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 4 May 2022 22:31:25 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:59:17 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 > > 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from > > ucomiss xmm0, xmm0 > jp label > jne label > > into > > ucomiss xmm0, xmm0 > jp label > > 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as > > xorl ecx, ecx > ucomiss xmm0, xmm1 > jnp done > pushf > andq [rsp], 0xffffff2b > popf > done: > movl eax, 1 > cmovel eax, ecx > > The patch changes this sequence into > > xorl ecx, ecx > ucomiss xmm0, xmm1 > movl eax, 1 > cmovpl eax, ecx > cmovnel eax, ecx > > 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. > > The benchmark results are as follow: > > Before: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op > FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op > FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op > FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op > FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op > FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op > FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op > FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op > > After: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op > FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op > FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op > FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op > FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op > FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op > FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op > FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op > > Thank you very much. The changes to `Float` and `Double` look good. I don't think we need additional tests, see test/jdk/java/lang/Math/IeeeRecommendedTests.java. At first i thought we no longer need PR #8459 but it seems both PRs are complimentary, albeit PR #8459 has more modest performance gains for the intrinsics. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From kvn at openjdk.java.net Wed May 4 23:20:18 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 4 May 2022 23:20:18 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: <6hRgLEWJfB8OHOYNJReUaMac469hvDuemoURK-aMy4Y=.963b7561-33e0-4069-8d89-8b447d4e0f0f@github.com> On Wed, 4 May 2022 19:32:41 GMT, Vladimir Kozlov wrote: >> Hi, >> >> This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 >> >> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from >> >> ucomiss xmm0, xmm0 >> jp label >> jne label >> >> into >> >> ucomiss xmm0, xmm0 >> jp label >> >> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> jnp done >> pushf >> andq [rsp], 0xffffff2b >> popf >> done: >> movl eax, 1 >> cmovel eax, ecx >> >> The patch changes this sequence into >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> movl eax, 1 >> cmovpl eax, ecx >> cmovnel eax, ecx >> >> 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. >> >> The benchmark results are as follow: >> >> Before: >> Benchmark Mode Cnt Score Error Units >> FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op >> FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op >> FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op >> FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op >> FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op >> FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op >> FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op >> FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op >> >> After: >> Benchmark Mode Cnt Score Error Units >> FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op >> FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op >> FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op >> FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op >> FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op >> FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op >> FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op >> FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op >> >> Thank you very much. > > src/hotspot/cpu/x86/x86_64.ad line 6998: > >> 6996: ins_encode %{ >> 6997: __ cmovl(Assembler::parity, $dst$$Register, $src$$Register); >> 6998: __ cmovl(Assembler::notEqual, $dst$$Register, $src$$Register); > > Should this be `equal`? I see that you swapped `src, dst` in `match()` but `format` is sill incorrect and the code is confusing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From kvn at openjdk.java.net Wed May 4 23:31:19 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 4 May 2022 23:31:19 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: On Wed, 4 May 2022 22:27:48 GMT, Paul Sandoz wrote: > The changes to `Float` and `Double` look good. I don't think we need additional tests, see test/jdk/java/lang/Math/IeeeRecommendedTests.java. Thank you, Paul for pointing the test. It means we need to run tier4 (which runs these tests with -Xcomp) to make sure methods are compiled by C2. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From xgong at openjdk.java.net Thu May 5 01:15:30 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 01:15:30 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 21:34:13 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename the "usePred" to "offsetInRange" > > IIUC when the hardware does not support predicated loads then any false `offsetIntRange` value causes the load intrinsic to fail resulting in the fallback, so it would not be materially any different to the current behavior, just more uniformly implemented. > > Why can't the intrinsic support the passing a boolean directly? Is it something to do with constants? If that is not possible I recommend creating named constant values and pass those all the way through rather than converting a boolean to an integer value. Then there is no need for a branch checking `offsetInRange`. > > Might be better to hold off until the JEP is integrated and then update, since this will conflict (`byte[]` and `ByteBuffer` load methods are removed and `MemorySegment` load methods are added). You could prepare for that now by branching off `vectorIntrinsics`. Thanks for your comments @PaulSandoz ! > IIUC when the hardware does not support predicated loads then any false `offsetIntRange` value causes the load intrinsic to fail resulting in the fallback, so it would not be materially any different to the current behavior, just more uniformly implemented. Yes, it's true that this patch doesn't have any different to the hardware that does not support the predicated loads. It only benefits the predicated feature supported systems like ARM SVE and X86 AVX-512. > Why can't the intrinsic support the passing a boolean directly? Is it something to do with constants? If that is not possible I recommend creating named constant values and pass those all the way through rather than converting a boolean to an integer value. Then there is no need for a branch checking offsetInRange. Yeah, I agree that it's not good by adding a branch checking for `offsetInRange`. But actually I met the constant issue that passing the values all the way cannot guarantee the argument a constant in compiler at the compile time. Do you have any better idea to fixing this? > Might be better to hold off until the JEP is integrated and then update, since this will conflict (byte[] and ByteBuffer load methods are removed and MemorySegment load methods are added). You could prepare for that now by branching off vectorIntrinsics. Agree. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From fjiang at openjdk.java.net Thu May 5 01:17:26 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Thu, 5 May 2022 01:17:26 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} I will test these changes on riscv, the results will be available in about one day. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From psandoz at openjdk.java.net Thu May 5 01:25:29 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 5 May 2022 01:25:29 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 01:13:23 GMT, Xiaohong Gong wrote: > Yeah, I agree that it's not good by adding a branch checking for `offsetInRange`. But actually I met the constant issue that passing the values all the way cannot guarantee the argument a constant in compiler at the compile time. Do you have any better idea to fixing this? That's odd, `boolean` constants are passed that are then converted to `int` constants. Did you try passing integer constants all the way through? ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From eliu at openjdk.java.net Thu May 5 01:31:52 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 5 May 2022 01:31:52 GMT Subject: RFR: 8282966: AArch64: Optimize VectorMask.toLong with SVE2 [v2] In-Reply-To: References: Message-ID: > This patch optimizes the backend implementation of VectorMaskToLong for > AArch64, given a more efficient approach to mov value bits from > predicate register to general purpose register as x86 PMOVMSK[1] does, > by using BEXT[2] which is available in SVE2. > > With this patch, the final code (input mask is byte type with > SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU > emulator) changes as below: > > Before: > > mov z16.b, p0/z, #1 > fmov x0, d16 > orr x0, x0, x0, lsr #7 > orr x0, x0, x0, lsr #14 > orr x0, x0, x0, lsr #28 > and x0, x0, #0xff > fmov x8, v16.d[1] > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #8 > > orr x8, xzr, #0x2 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #16 > > orr x8, xzr, #0x3 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #24 > > orr x8, xzr, #0x4 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #32 > > mov x8, #0x5 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #40 > > orr x8, xzr, #0x6 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #48 > > orr x8, xzr, #0x7 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #56 > > After: > > mov z16.b, p0/z, #1 > mov z17.b, #1 > bext z16.d, z16.d, z17.d > mov z17.d, #0 > uzp1 z16.s, z16.s, z17.s > uzp1 z16.h, z16.h, z17.h > uzp1 z16.b, z16.b, z17.b > mov x0, v16.d[0] > > [1] https://www.felixcloutier.com/x86/pmovmskb > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge jdk:master Change-Id: Ifa60f3b79513c22dbf932f1da623289687bc1070 - 8282966: AArch64: Optimize VectorMask.toLong with SVE2 This patch optimizes the backend implementation of VectorMaskToLong for AArch64, given a more efficient approach to mov value bits from predicate register to general purpose register as x86 PMOVMSK[1] does, by using BEXT[2] which is available in SVE2. With this patch, the final code (input mask is byte type with SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU emulator) changes as below: Before: mov z16.b, p0/z, #1 fmov x0, d16 orr x0, x0, x0, lsr #7 orr x0, x0, x0, lsr #14 orr x0, x0, x0, lsr #28 and x0, x0, #0xff fmov x8, v16.d[1] orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #8 orr x8, xzr, #0x2 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #16 orr x8, xzr, #0x3 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #24 orr x8, xzr, #0x4 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #32 mov x8, #0x5 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #40 orr x8, xzr, #0x6 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #48 orr x8, xzr, #0x7 whilele p1.d, xzr, x8 lastb x8, p1, z16.d orr x8, x8, x8, lsr #7 orr x8, x8, x8, lsr #14 orr x8, x8, x8, lsr #28 and x8, x8, #0xff orr x0, x0, x8, lsl #56 After: mov z16.b, p0/z, #1 mov z17.b, #1 bext z16.d, z16.d, z17.d mov z17.d, #0 uzp1 z16.s, z16.s, z17.s uzp1 z16.h, z16.h, z17.h uzp1 z16.b, z16.b, z17.b mov x0, v16.d[0] [1] https://www.felixcloutier.com/x86/pmovmskb [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8337/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8337&range=01 Stats: 118 lines in 7 files changed: 66 ins; 2 del; 50 mod Patch: https://git.openjdk.java.net/jdk/pull/8337.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8337/head:pull/8337 PR: https://git.openjdk.java.net/jdk/pull/8337 From xgong at openjdk.java.net Thu May 5 01:46:19 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 01:46:19 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 01:21:40 GMT, Paul Sandoz wrote: > > Yeah, I agree that it's not good by adding a branch checking for `offsetInRange`. But actually I met the constant issue that passing the values all the way cannot guarantee the argument a constant in compiler at the compile time. Do you have any better idea to fixing this? > > That's odd, `boolean` constants are passed that are then converted to `int` constants. Did you try passing integer constants all the way through? I will try again. I remember the main cause is the calling of `fromArray0` from `fromArray`, it is not annotated with `ForceInline`. The arguments might not be compiled to a constant for cases that the offset is not in the array range like tail loop. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Thu May 5 01:55:19 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 01:55:19 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 01:42:48 GMT, Xiaohong Gong wrote: > > > Yeah, I agree that it's not good by adding a branch checking for `offsetInRange`. But actually I met the constant issue that passing the values all the way cannot guarantee the argument a constant in compiler at the compile time. Do you have any better idea to fixing this? > > > > > > That's odd, `boolean` constants are passed that are then converted to `int` constants. Did you try passing integer constants all the way through? > > I will try again. I remember the main cause is the calling of `fromArray0` from `fromArray`, it is not annotated with `ForceInline`. The arguments might not be compiled to a constant for cases that the offset is not in the array range like tail loop. I tried to pass the integer constant all the way, and unfortunate that the `offsetInRange` is not compiled to a constant. The following assertion in the `vectorIntrinsics.cpp` will fail: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "not a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Thu May 5 02:17:20 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 02:17:20 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Thu, 28 Apr 2022 00:13:49 GMT, Sandhya Viswanathan wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename the "usePred" to "offsetInRange" > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: > >> 1230: // out when current case uses the predicate feature. >> 1231: if (!supports_predicate) { >> 1232: bool use_predicate = false; > > If we rename this to needs_predicate it will be easier to understand. Thanks for the comment! This local variable will be removed after adding the similar intrinsify for store masked. Please help to see the PR https://github.com/openjdk/jdk/pull/8544. Thanks so much! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Thu May 5 02:17:20 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 02:17:20 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: <8Yu4J-PCYFJtBXrfgWoCbaR-7QZTXH4IzmXOf_lk164=.66071c45-1f1a-4931-a414-778f353c7e83@github.com> References: <35S4J_r9jBw_-SAow2oMYaSsTvubhSmZFVPb_VM6KEg=.7feff8fa-6e20-453e-aed6-e53c7d9beaad@github.com> <8Yu4J-PCYFJtBXrfgWoCbaR-7QZTXH4IzmXOf_lk164=.66071c45-1f1a-4931-a414-778f353c7e83@github.com> Message-ID: On Thu, 31 Mar 2022 02:15:26 GMT, Quan Anh Mai wrote: >> I'm afraid not. "Load + Blend" makes the elements of unmasked lanes to be `0`. Then a full store may change the values in the unmasked memory to be 0, which is different with the mask store API definition. > > The blend should be with the intended-to-store vector, so that masked lanes contain the need-to-store elements and unmasked lanes contain the loaded elements, which would be stored back, which results in unchanged values. Hi @merykitty @jatin-bhateja , could you please help to take a review at the similar store masked PR https://github.com/openjdk/jdk/pull/8544 ? Any feedback is welcome! Thanks so much! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From yyang at openjdk.java.net Thu May 5 02:21:20 2022 From: yyang at openjdk.java.net (Yi Yang) Date: Thu, 5 May 2022 02:21:20 GMT Subject: RFR: 8282638: [JVMCI] Export array fill stubs to JVMCI compiler In-Reply-To: References: Message-ID: <8rc0zutOmrphZucsrQPOjJAvlwskDKvVx1ROVvgqz1Y=.6056678a-344f-444a-b0cc-3f26c6562311@github.com> On Fri, 29 Apr 2022 08:36:21 GMT, Doug Simon wrote: > I'm curious - what is the background for this change? To implement array filling for graal in PR-4343. ------------- PR: https://git.openjdk.java.net/jdk/pull/7685 From jrose at openjdk.java.net Thu May 5 02:45:17 2022 From: jrose at openjdk.java.net (John R Rose) Date: Thu, 5 May 2022 02:45:17 GMT Subject: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong wrote: >> Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector `"load + blend + store"`. For example, storing a vector` "v"` controlled by a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be implemented with: >> >> >> 1) mem_v = load(addr) ; non-masked load from the same memory >> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask >> 3) store(addr, v) ; non-masked store into the memory >> >> >> Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API. >> >> Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about `20x ~ 50x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms >> >> Similar performance gain can also be observed on ARM NEON system. >> >> And here is the performance data on X86 avx-512 system, which improves about `1.88x - 2.81x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms >> >> Similar performance gain can also be observed on ARM SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: > > 8284050: [vectorapi] Optimize masked store for non-predicated architectures The JIT (in all other circumstances AFAIK) never produces "phantom stores", stores into Java variables which are not specified as the target of a JVM store instruction (putfield, dastore, etc.). The fact that a previously-read value is used by the phantom store does not make it any better. Yes, the memory states may be correct after the blend and store is done, but the effect on the Java Memory Model is to issue the extra phantom stores of the unselected array elements. Under certain circumstances, this will create race conditions after the optimization where there were no race conditions before the optimization. Other threads could (under Java Memory Model rules) witness the effects of the phantom stores. If the Java program is properly synchronized, the introduction of an illegitimate race condition can cause another thread, now in an illegal race, to see an old value in a variable (the recopied unselected array element) which the JMM says is impossible. Yes, this only shows up in multi-threaded programs, and ones where two threads step on one array, but Java is a multi-threaded language, and it must conform to its own specification as such. This blend technique would be very reasonable if there is no race condition. (Except at the very beginning or end of arrays.) And the JMM leaves room for many optimizations. And yet I think this is a step too far. I'd like to be wrong about this, but I don't think I am. So, I think you need to use a different technique, other than blend-and-unpredicated-store, for masked stores on non-predicated architectures. For example, you could simulate a masked store instruction on an architecture that supports scatter (scattering values of the array type). Do this by setting up two vectors of machine pointers. One vector points to each potentially-affected element of the array (some kind of index computation plus a scaled iota vector). The other vector is set up similarly, but points into a fixed-sized, thread-local buffer, what I would call the "bit bucket". Blend the addresses, and then scatter, so that selected array lanes are updated, and unselected values are sent to the "bit bucket". This is complex enough (and platform-dependent enough) that you probably need to write a hand-coded assembly language subroutine, to call from the JIT code. Sort of like the arraycopy stubs. It's even more work than the proposed patch here, but it's the right thing, I'm afraid. src/hotspot/share/opto/vectorIntrinsics.cpp line 1363: > 1361: // Use the vector blend to implement the masked store. The biased elements are the original > 1362: // values in the memory. > 1363: Node* mem_val = gvn().transform(LoadVectorNode::make(0, control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt)); I'm sorry to say it, but I am pretty sure this is an invalid optimization. See top-level comment for more details. ------------- Changes requested by jrose (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8544 From xgong at openjdk.java.net Thu May 5 03:21:25 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 03:21:25 GMT Subject: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2] In-Reply-To: References: Message-ID: <2nHlMA_m35Al9nbcBILE63XcB63XuSWI_RbcUC96-As=.4fe30c48-52ab-4995-b0a1-f327009aa8a8@github.com> On Thu, 5 May 2022 02:27:03 GMT, John R Rose wrote: >> Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: >> >> 8284050: [vectorapi] Optimize masked store for non-predicated architectures > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1363: > >> 1361: // Use the vector blend to implement the masked store. The biased elements are the original >> 1362: // values in the memory. >> 1363: Node* mem_val = gvn().transform(LoadVectorNode::make(0, control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt)); > > I'm sorry to say it, but I am pretty sure this is an invalid optimization. > See top-level comment for more details. Thanks for your comments! Yeah, this actually influences something due to the Java Memory Model rules which I missed to consider more. I will try the scatter ways instead. Thanks so much! ------------- PR: https://git.openjdk.java.net/jdk/pull/8544 From jbhateja at openjdk.java.net Thu May 5 03:23:23 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 5 May 2022 03:23:23 GMT Subject: Integrated: 8284813: x86 Code cleanup related to move instructions. In-Reply-To: References: Message-ID: On Wed, 13 Apr 2022 19:11:50 GMT, Jatin Bhateja wrote: > Summary of changes: > > - Correct feature checks in some assembler move instruction. > - Explicitly pass opmask register in routines accepting merge argument. > - Code re-organization related to move instruction, pull out the merge argument up to instruction pattern or top level caller. > - Add missing encoding based move elision checks in some macro assembly routines. > > Kindly review and share your feedback. > > Regards, > Jatin This pull request has now been integrated. Changeset: 3092b561 Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/3092b5615d4d24c7b38a8e7e5759dfa2ef8616ca Stats: 189 lines in 8 files changed: 37 ins; 66 del; 86 mod 8284813: x86 Code cleanup related to move instructions. Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.java.net/jdk/pull/8230 From xgong at openjdk.java.net Thu May 5 03:29:16 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 03:29:16 GMT Subject: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2] In-Reply-To: References: Message-ID: <4k-ZTE8Uax5vrH9GMAc2MJClnaIc4zB9vteafm24S44=.e1f50cc8-3020-4091-831f-e5fa2b623f7a@github.com> On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong wrote: >> Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector `"load + blend + store"`. For example, storing a vector` "v"` controlled by a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be implemented with: >> >> >> 1) mem_v = load(addr) ; non-masked load from the same memory >> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask >> 3) store(addr, v) ; non-masked store into the memory >> >> >> Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API. >> >> Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about `20x ~ 50x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms >> >> Similar performance gain can also be observed on ARM NEON system. >> >> And here is the performance data on X86 avx-512 system, which improves about `1.88x - 2.81x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms >> >> Similar performance gain can also be observed on ARM SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: > > 8284050: [vectorapi] Optimize masked store for non-predicated architectures > _Mailing list message from [Hans Boehm](mailto:hboehm at google.com) on [hotspot-dev](mailto:hotspot-dev at mail.openjdk.java.net):_ > > Naive question: What happens if one of the vector elements that should not have been updated is concurrently being written by another thread? Aren't you generating writes to vector elements that should not have been written? > > Hans > > On Wed, May 4, 2022 at 7:08 PM Xiaohong Gong wrote: Yeah, this is the similar concern with what @rose00 mentioned above. The current solution cannot work well for multi-thread progresses. I will consider other better solutions. Thanks for the comments! src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3483: > 3481: ByteSpecies vsp = vspecies(); > 3482: if (offset >= 0 && offset <= (a.length - vsp.length())) { > 3483: intoBooleanArray0(a, offset, m, /* offsetInRange */ true); The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8544 From jbhateja at openjdk.java.net Thu May 5 04:38:24 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 5 May 2022 04:38:24 GMT Subject: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2] In-Reply-To: <2nHlMA_m35Al9nbcBILE63XcB63XuSWI_RbcUC96-As=.4fe30c48-52ab-4995-b0a1-f327009aa8a8@github.com> References: <2nHlMA_m35Al9nbcBILE63XcB63XuSWI_RbcUC96-As=.4fe30c48-52ab-4995-b0a1-f327009aa8a8@github.com> Message-ID: <039lZ4RUKsmDUJAZEitlkbrvCE7p9w37KIc-F7Qr7jA=.f3e24088-bba3-448a-8720-649928de23f2@github.com> On Thu, 5 May 2022 03:17:35 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 1363: >> >>> 1361: // Use the vector blend to implement the masked store. The biased elements are the original >>> 1362: // values in the memory. >>> 1363: Node* mem_val = gvn().transform(LoadVectorNode::make(0, control(), memory(addr), addr, addr_type, mem_num_elem, mem_elem_bt)); >> >> I'm sorry to say it, but I am pretty sure this is an invalid optimization. >> See top-level comment for more details. > > Thanks for your comments! Yeah, this actually influences something due to the Java Memory Model rules which I missed to consider more. I will try the scatter ways instead. Thanks so much! Yes, phantom store can write back stale unintended value and may create problem in multithreded applications since blending is done with an older loaded value. ------------- PR: https://git.openjdk.java.net/jdk/pull/8544 From xliu at openjdk.java.net Thu May 5 05:39:58 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 5 May 2022 05:39:58 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps Message-ID: 8286104: use aggressive liveness for unstable_if traps ------------- Commit messages: - 8286104: use aggressive liveness for unstable_if traps Changes: https://git.openjdk.java.net/jdk/pull/8545/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286104 Stats: 119 lines in 6 files changed: 103 ins; 3 del; 13 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From jbhateja at openjdk.java.net Thu May 5 05:47:47 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 5 May 2022 05:47:47 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v2] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: - 8284960: Correcting a typo. - 8284960: Integrating changes from panama-vector (Add @since 19 tags). - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: AARCH64 backend changes. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Integration of JEP 426: Vector API (Fourth Incubator) ------------- Changes: https://git.openjdk.java.net/jdk/pull/8425/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=01 Stats: 37900 lines in 214 files changed: 16527 ins; 16923 del; 4450 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From john.r.rose at oracle.com Thu May 5 06:16:07 2022 From: john.r.rose at oracle.com (John Rose) Date: Thu, 5 May 2022 06:16:07 +0000 Subject: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2] In-Reply-To: <4k-ZTE8Uax5vrH9GMAc2MJClnaIc4zB9vteafm24S44=.e1f50cc8-3020-4091-831f-e5fa2b623f7a@github.com> References: <4k-ZTE8Uax5vrH9GMAc2MJClnaIc4zB9vteafm24S44=.e1f50cc8-3020-4091-831f-e5fa2b623f7a@github.com> Message-ID: > On May 4, 2022, at 8:29 PM, Xiaohong Gong wrote: > > The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? That part is ok, yes. I wish we could get the same effect with loop optimizations but I don?t know an easy way. The explicit check in the source code gives the JIT a crutch but I hope we can figure out a way in the future to integrate mask logic into range check elimination logic, making the crutches unnecessary. For now it?s fine. From duke at openjdk.java.net Thu May 5 08:16:29 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 5 May 2022 08:16:29 GMT Subject: RFR: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: <0SZNw-9p9QDyotJDq-E4piJ-U2Jxc3vRMozOeyGWqn8=.115cc0d3-2a36-40bb-ad45-c10032fd9a8b@github.com> References: <7Z79cOcf4xyRUV4wQR_X4ZVvCta5fFevAS6HPBqwo2k=.bd06617d-7fa7-499d-8083-ee552b4680ed@github.com> <0SZNw-9p9QDyotJDq-E4piJ-U2Jxc3vRMozOeyGWqn8=.115cc0d3-2a36-40bb-ad45-c10032fd9a8b@github.com> Message-ID: On Thu, 31 Mar 2022 05:49:47 GMT, Jatin Bhateja wrote: >> @jatin-bhateja >> For all other cases (eg `addI`, `convI2L`, etc ) we are currently using >> https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1729 >> This seems to add the dependencies before the inputs. But that depends on there being space before the inputs. >> That is why we check `cisc->oper_input_base() > 1` >> https://github.com/openjdk/jdk/blob/cc598e03de39dd6e8d7e208a69d85b6a9cd0062f/src/hotspot/share/opto/chaitin.cpp#L1727 >> >> All other cases where we have spilling (eg `addI`, `convI2L`, etc ), we have `cisc->oper_input_base() == 2`. >> I think `_in[0]` is for control, and `_in[1]` for the memory edge (in the other cases not including `MoveF2I` etc). >> `cisc->oper_input_base() == 2` in all other cases, because in `InstructForm::oper_input_base` we ask for `MatchNode::needs_ideal_memory_edge`. >> In all cases except for `MoveF2I` etc, we say we need a memory edge there. Now we don't have space to set a memory edge before the inputs, and we simply do not set one. >> >> So how do we work with this? Do I just remove `cisc->ins_req(1,src)` for all cases and alway use `add_req` as you have suggested? Or do I leave the other cases, and just make an else case and use `add_prec` there for `MoveF2I` etc? I am wondering what is a clean and consistent solution here. >> >> Do you know why we add the memory edge before the inputs in the other cases? If I use `add_prec` then that adds the memory edge after the inputs, correct? > >> Do you know why we add the memory edge before the inputs in the other cases? If I use `add_prec` then that adds the memory edge after the inputs, correct? > > Currently memory edges are being added for instructions which directly access memory, ADLC enforces this by scanning through Ideal nodes of a matcher pattern in top-down manner. > Almost all the machine nodes decorated with **Flag_is_cisc_alternate** flag access Load/Store IR in their selection patterns. Only exceptions[1][2][3] as you pointed out are the ones which perform load/store from stack locations. Thus all I am suggesting is for all such cases without doing many changes we can add the DEF_Spill precedence edge which gets added after all the inputs but will still constrain the scheduling order. > > Thus an instructions which has a CISC alternate but lacks memory_operand can be handled by adding a prescience edge. > > [1] MoveF2I_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } > [2] MoveI2F_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } > [3] MoveD2L_stack_regNode() { _num_opnds = 2; _opnds = _opnd_array; init_flags(Flag_is_cisc_alternate | Flag_needs_anti_dependence_check); } Thanks @jatin-bhateja @vnkozlov @TobiHartmann for the help, suggestions and reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From duke at openjdk.java.net Thu May 5 08:19:27 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 5 May 2022 08:19:27 GMT Subject: Integrated: 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc In-Reply-To: References: Message-ID: <4iuJw21H_0iqZuVk0x9clyIvx6KdTunpJeTOTRDyS2U=.d8154e9e-b665-46a3-b92d-27b8497b967e@github.com> On Mon, 21 Mar 2022 11:02:35 GMT, Emanuel Peter wrote: > Update: > After the inputs from @jatin-bhateja, and verifying with @vnkozlov and @TobiHartmann , I have implemented a much simpler fix: > Whenever there is no pre-allocated space before the inputs for the memory edge, we simply add the memory edge after the inputs. > > This is a bit of an ad-hoc fix, but it is much simpler than the other two options. Changing the `.ad` files requires much more work. Adding `stackSlot` to `MatchNode::needs_ideal_memory_edge` would also be an ad-hoc fix. > > The added test still fails with other changes in mainline, and passes with my new fix. Ran it 50 times to verify. > Ran larger test suite, all passed. > > --------- > > In `PhaseChaitin::fixup_spills` we decide if we need a memory edge when reading from a spilled register. > Unfortunately, for `MoveF2I`, `MoveD2L` etc we do not add such memory edges. > This can lead to reversed scheduling, where we read from a `stackSlot` before we wrote to it, leading to wrong results. > (This happens intermittently, but the regression test did reproduce it at about a 10% rate) > > In `PhaseChaitin::fixup_spills` we decided if such a memory edge needs to be added by comparing `oper_input_base()` of the node before spilling and after spilling. If `oper_input_base()` of the `mach` node (before spilling) is 1, this means that node does not have a memory edge yet. And if `oper_input_base()` of the `cisc` node (after spilling) is 2, this means it needs a memory edge. In all spill cases I could find, the value is 1 and 2 respectively, except for MoveF2I etc, there it is 1 and 1 respectively, thus the memory edge was omitted. > > The values of `oper_input_base()` are determined in `InstructForm::oper_input_base`, where we query `MatchNode::needs_ideal_memory_edge`. This function checks if there is an `_opType` in the recursive match structure of this mach node, that matches one of a list of nodes (`StoreI, StoreF, ... LoadI, StoreF, ... etc.`). Unfortunately, MoveF2I etc do not have such a match. Instead of a `StoreF/LoadF`, they used `stackSlotF` (which is not recognized in `MatchNode::needs_ideal_memory_edge`). So it thinks there is no need for a memory edge. > > We saw 2 options to fix this issue: > 1) add `stackSlotI/L/P/D/F` to `MatchNode::needs_ideal_memory_edge`. However, this seems to be an inconsistent solution. The other items in that list are nodes, `stackSlot` is not. And other operations (like `addI, testI, etc.`) all use `LoadI/StoreI`, which is more generic (for heap and stack). > 2) Change the arguments and match rules to not use `stackSlot`, but `memory` arguments and `LoadI/StoreI` nodes. This is a consistent and more generic solution (the MoveF2I operation could now be used not just for stack spilling but also reading/writing from/to memory). > > I picked option 2. > Further, I now assume that we can always add such a memory edge when reading from a spilled register. This assumption did not get violated in my more extensive testing. > > While the regression test only failed about 10% due to this bug, the assert I added verifying that we add these memory edges did trigger 100% before I applied the fix in x86_64.ad. This means the memory edge was missing every time, just the scheduling varied and we were lucky most of the time. > > There are a few open points for discussion: > > - `loadSSI/L/P/F/D` still uses `stackSlot`. I have never observed that this operation gets its register spilled. But I still wonder if we should not have this operation use `memory/LoadI` instead of `stackSlotI`. I think we might even be able to simply remove `loadSSI` because it is already covered by what `loadI` does. (Update: Tests suggest `loadSSX` can be removed from `x86_64.ad`) > - So far I have only applied my fix to `x86_64.ad` -> we probably want to apply it to all platforms. > - Other platforms use `stackSlot` more often, for example in `x86_32.ad`. It may well be that some of these operation could also be spilled, which would probably also lead to missing memory edges. I wonder if we should maybe remove all occurances of `stackSlot` in the `ad` files, or if we should still add `stackSlot` to `MatchNode::needs_ideal_memory_edge` to ensure we alway can add the memory edges. This pull request has now been integrated. Changeset: 4a5e7a1a Author: Emanuel Peter Committer: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/4a5e7a1ada611cfdefdc3b9a6fada05494e07390 Stats: 23 lines in 2 files changed: 23 ins; 0 del; 0 mod 8282555: Missing memory edge when spilling MoveF2I, MoveD2L etc Reviewed-by: kvn, thartmann, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/7889 From roland at openjdk.java.net Thu May 5 08:35:58 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 5 May 2022 08:35:58 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v8] In-Reply-To: References: Message-ID: <9HLAsZAIDVUTM8SIqgJZiyqHucB9S6qKwYGkLqgjq_I=.c0c00b50-f5b9-4539-9a74-f4abeed3928b@github.com> > The type for the iv phi of a counted loop is computed from the types > of the phi on loop entry and the type of the limit from the exit > test. Because the exit test is applied to the iv after increment, the > type of the iv phi is at least one less than the limit (for a positive > stride, one more for a negative stride). > > Also, for a stride whose absolute value is not 1 and constant init and > limit values, it's possible to compute accurately the iv phi type. > > This change caused a few failures and I had to make a few adjustments > to loop opts code as well. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: - whitespaces - more review - Merge branch 'master' into JDK-8281429 - review - undo unneeded change - Merge branch 'master' into JDK-8281429 - redo change removed by error - review - Merge branch 'master' into JDK-8281429 - undo - ... and 13 more: https://git.openjdk.java.net/jdk/compare/29c2e54c...6d145597 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7823/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7823&range=07 Stats: 404 lines in 7 files changed: 385 ins; 1 del; 18 mod Patch: https://git.openjdk.java.net/jdk/pull/7823.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7823/head:pull/7823 PR: https://git.openjdk.java.net/jdk/pull/7823 From roland at openjdk.java.net Thu May 5 08:35:59 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 5 May 2022 08:35:59 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Mon, 2 May 2022 16:06:42 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 19 commits: >> >> - undo unneeded change >> - Merge branch 'master' into JDK-8281429 >> - redo change removed by error >> - review >> - Merge branch 'master' into JDK-8281429 >> - undo >> - test fix >> - more test >> - test & fix >> - other fix >> - ... and 9 more: https://git.openjdk.java.net/jdk/compare/dc635844...19b38997 > > I am fine with testing range [MIN_VALUE + stride, MAX_VALUE - stride] to exercise unsigned arithmetic. Whatever maximum loopopts allows. @vnkozlov the new commits should address your comments. Let me know if the new tests cover what you asked for. ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From xgong at openjdk.java.net Thu May 5 08:41:11 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 08:41:11 GMT Subject: RFR: 8284050: [vectorapi] Optimize masked store for non-predicated architectures [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 02:09:39 GMT, Xiaohong Gong wrote: >> Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector `"load + blend + store"`. For example, storing a vector` "v"` controlled by a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be implemented with: >> >> >> 1) mem_v = load(addr) ; non-masked load from the same memory >> 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask >> 3) store(addr, v) ; non-masked store into the memory >> >> >> Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API. >> >> Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about `20x ~ 50x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms >> >> Similar performance gain can also be observed on ARM NEON system. >> >> And here is the performance data on X86 avx-512 system, which improves about `1.88x - 2.81x`: >> >> Benchmark before after Units >> StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms >> StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms >> StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms >> StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms >> StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms >> StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms >> >> Similar performance gain can also be observed on ARM SVE system. > > Xiaohong Gong has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit: > > 8284050: [vectorapi] Optimize masked store for non-predicated architectures > _Mailing list message from [John Rose](mailto:john.r.rose at oracle.com) on [hotspot-dev](mailto:hotspot-dev at mail.openjdk.java.net):_ > > > On May 4, 2022, at 8:29 PM, Xiaohong Gong wrote: > > The offset check could save the `checkMaskFromIndexSize` for cases that offset are in the valid array bounds, which also improves the performance. @rose00 , do you think this part of change is ok at least? > > That part is ok, yes. I wish we could get the same effect with loop optimizations but I don?t know an easy way. The explicit check in the source code gives the JIT a crutch but I hope we can figure out a way in the future to integrate mask logic into range check elimination logic, making the crutches unnecessary. For now it?s fine. Thanks! So I will separate this part out and fix it in another PR first. For the store masked vectorization with scatter or other ideas, I'm not quite sure whether they can always benefit cross architectures and need more investigation. I prefer to close this PR now. Thanks for all your comments! ------------- PR: https://git.openjdk.java.net/jdk/pull/8544 From xgong at openjdk.java.net Thu May 5 08:41:12 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 08:41:12 GMT Subject: Withdrawn: 8284050: [vectorapi] Optimize masked store for non-predicated architectures In-Reply-To: References: Message-ID: <_XofoM1n91tFSRAE0q4CCkysHFK4Wha8a4IYaoj2xsU=.3df8d210-cf52-4bf0-81ba-a4cf3491ad20@github.com> On Thu, 5 May 2022 02:00:04 GMT, Xiaohong Gong wrote: > Currently the vectorization of masked vector store is implemented by the masked store instruction only on architectures that support the predicate feature. The compiler will fall back to the java scalar code for non-predicate supported architectures like ARM NEON. However, for these systems, the masked store can be vectorized with the non-masked vector `"load + blend + store"`. For example, storing a vector` "v"` controlled by a mask` "m"` into a memory with address` "addr" (i.e. "store(addr, v, m)")` can be implemented with: > > > 1) mem_v = load(addr) ; non-masked load from the same memory > 2) v = blend(mem_v, v, m) ; blend with the src vector with the mask > 3) store(addr, v) ; non-masked store into the memory > > > Since the first full loading needs the array offset must be inside of the valid array bounds, we make the compiler do the vectorization only when the offset is in range of the array boundary. And the compiler will still fall back to the java scalar code if not all offsets are valid. Besides, the original offset check for masked lanes are only applied when the offset is not always inside of the array range. This also improves the performance for masked store when the offset is always valid. The whole process is similar to the masked load API. > > Here is the performance data for the masked vector store benchmarks on a X86 non avx-512 system, which improves about `20x ~ 50x`: > > Benchmark before after Units > StoreMaskedBenchmark.byteStoreArrayMask 221.733 11094.126 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 41.086 1034.408 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 73.820 1985.015 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 75.028 2027.557 ops/ms > StoreMaskedBenchmark.longStoreArrayMask 40.929 1032.928 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 135.794 5307.567 ops/ms > > Similar performance gain can also be observed on ARM NEON system. > > And here is the performance data on X86 avx-512 system, which improves about `1.88x - 2.81x`: > > Benchmark before after Units > StoreMaskedBenchmark.byteStoreArrayMask 11185.956 21012.824 ops/ms > StoreMaskedBenchmark.doubleStoreArrayMask 1480.644 3911.720 ops/ms > StoreMaskedBenchmark.floatStoreArrayMask 2738.352 7708.365 ops/ms > StoreMaskedBenchmark.intStoreArrayMask 4191.904 9300.428 ops/ms > StoreMaskedBenchmark.longStoreArrayMask 2025.031 4604.504 ops/ms > StoreMaskedBenchmark.shortStoreArrayMask 8339.389 17817.128 ops/ms > > Similar performance gain can also be observed on ARM SVE system. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8544 From xgong at openjdk.java.net Thu May 5 08:56:07 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 08:56:07 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Rename "use_predicate" to "needs_predicate" ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/9b2d2f19..9c69206e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Thu May 5 08:56:08 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 5 May 2022 08:56:08 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 02:14:08 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 1232: >> >>> 1230: // out when current case uses the predicate feature. >>> 1231: if (!supports_predicate) { >>> 1232: bool use_predicate = false; >> >> If we rename this to needs_predicate it will be easier to understand. > > Thanks for the comment! This local variable will be removed after adding the similar intrinsify for store masked. Please help to see the PR https://github.com/openjdk/jdk/pull/8544. Thanks so much! Renamed to "needs_predicate". Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From emanuel.peter at oracle.com Thu May 5 09:25:44 2022 From: emanuel.peter at oracle.com (Emanuel Peter) Date: Thu, 5 May 2022 09:25:44 +0000 Subject: Node::find(int) should not traverse from new to old nodes Message-ID: Hi, I have been bothered by find_node(idx) for a while. When I am looking at the Mach graph, and search for a node with an idx, I sometimes get old, sometimes new nodes. The reason is that Node::find does not just traverse input/output edges, but also debug_orig (if ASSERT is enabled). Via debug_orig, most Mach nodes link to their old node from the IR of previous phases. This way we can find multiple nodes for an idx. Node::find returns the last one it finds - sometimes the new one is last, sometimes the old one is last. At least it prints both pointers to the terminal. Here, I have a detailed writeup, and some proposed solutions: https://bugs.openjdk.java.net/browse/JDK-8286179 Do you agree that we should fix this? Would you pick one of my solutions, or propose a new one? Since this is a tool that probably many people are using for debugging, I do not want to break it for you. Best Regards, Emanuel Peter From dnsimon at openjdk.java.net Thu May 5 12:17:27 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Thu, 5 May 2022 12:17:27 GMT Subject: RFR: 8282638: [JVMCI] Export array fill stubs to JVMCI compiler In-Reply-To: References: Message-ID: On Fri, 4 Mar 2022 03:13:38 GMT, Yi Yang wrote: > Export array _jint_fill,_jshort_fill,jbyte_fill,_arrayof_jshort_fill,_arrayof_jbyte_fill,_arrayof_jint_fill to JVMCI compiler Thanks for the explanation. Link for reference: https://github.com/oracle/graal/pull/4343 ------------- PR: https://git.openjdk.java.net/jdk/pull/7685 From shade at openjdk.java.net Thu May 5 13:42:42 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Thu, 5 May 2022 13:42:42 GMT Subject: RFR: 8286190: Add test to verify constant folding for Enum fields Message-ID: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> There is the [JDK-8161245](https://bugs.openjdk.java.net/browse/JDK-8161245) to make compilers trust Enum final fields. It was implicitly implemented by [JDK-8234049](https://bugs.openjdk.java.net/browse/JDK-8234049), which added the wildcard trust for everything in java/lang: https://github.com/openjdk/jdk/blob/c5a0687f80367a3a284dfd56781c371826264d3b/src/hotspot/share/ci/ciField.cpp#L230 It would be better to have the explicit test that verifies the constant folding of Enum fields indeed happens. Additional testing: - [x] Linux x86_64 fastdebug, new test passes - [x] Linux x86_64 release, new test passes - [x] Linux x86_32 fastdebug, new test passes ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/8551/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8551&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286190 Stats: 70 lines in 1 file changed: 70 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8551.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8551/head:pull/8551 PR: https://git.openjdk.java.net/jdk/pull/8551 From roland at openjdk.java.net Thu May 5 15:05:48 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 5 May 2022 15:05:48 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop Message-ID: This is another small enhancement for a code shape that showed up in a MemorySegment micro benchmark. The shape to optimize is the one from test1: for (int i = 0; i < size; i++) { long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; j = Objects.checkIndex(j, size * 4); if (((base + j) & 3) != 0) { throw new RuntimeException(); } v += UNSAFE.getInt(base + j); } In that code shape, the loop iv is first scaled, result is then casted to long, range checked and finally address of memory location is computed. The alignment check is transformed so the loop body has no check In order to eliminate the range check, that loop is transformed into: for (int i1 = ..) { for (int i2 = ..) { long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; j = Objects.checkIndex(j, size * 4); v += UNSAFE.getInt(base + j); } } The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... In this case, the type of the ConvI2L is [min_jint, max_jint] and type of CastLL is [0, max_jint] (the CastLL has a narrower type). I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that case. The convI2L and CastII types can be set to [0, max_jint]. The new address shape is then: (AddP base (ConvI2L (CastII (LShiftI (AddI ... which optimize well. (LShiftI (AddI ... is transformed into (AddI (LShiftI ... because one of the AddI input is loop invariant (i2) and we have: (AddP base (ConvI2L (CastII (AddI (LShiftI ... Then because the ConvI2L and CastII types are [0, max_jint], the AddI is pushed through the ConvI2L and CastII: (AddP base (AddL (ConvI2L (CastII (LShiftI ... base and one of the inputs of the AddL are loop invariant so this transformed into: (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... The (AddP ...) is loop invariant so computed before entry. The (ConvI2L ...) only depends on the loop iv. The resulting address is a shift + an add. The address before transformation requires 2 adds + a shift. Also after unrolling, the adress of the second access in the loop is cheaper to compute as it can be derived from the address of the first access. For all of this to work: 1) I added a CastLL::Ideal transformation: (CastLL (ConvI2L into (ConvI2l (CastII 2) I also had to prevent split if to transform (LShiftI (Phi for the iv Phi of a counted loop. test2 and test3 test 1) and 2) separately. ------------- Commit messages: - whitespaces - extra test comment - test & fix Changes: https://git.openjdk.java.net/jdk/pull/8555/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286197 Stats: 162 lines in 5 files changed: 161 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8555.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8555/head:pull/8555 PR: https://git.openjdk.java.net/jdk/pull/8555 From kvn at openjdk.java.net Thu May 5 15:32:30 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 5 May 2022 15:32:30 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v8] In-Reply-To: <9HLAsZAIDVUTM8SIqgJZiyqHucB9S6qKwYGkLqgjq_I=.c0c00b50-f5b9-4539-9a74-f4abeed3928b@github.com> References: <9HLAsZAIDVUTM8SIqgJZiyqHucB9S6qKwYGkLqgjq_I=.c0c00b50-f5b9-4539-9a74-f4abeed3928b@github.com> Message-ID: On Thu, 5 May 2022 08:35:58 GMT, Roland Westrelin wrote: >> The type for the iv phi of a counted loop is computed from the types >> of the phi on loop entry and the type of the limit from the exit >> test. Because the exit test is applied to the iv after increment, the >> type of the iv phi is at least one less than the limit (for a positive >> stride, one more for a negative stride). >> >> Also, for a stride whose absolute value is not 1 and constant init and >> limit values, it's possible to compute accurately the iv phi type. >> >> This change caused a few failures and I had to make a few adjustments >> to loop opts code as well. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - whitespaces > - more review > - Merge branch 'master' into JDK-8281429 > - review > - undo unneeded change > - Merge branch 'master' into JDK-8281429 > - redo change removed by error > - review > - Merge branch 'master' into JDK-8281429 > - undo > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/29c2e54c...6d145597 This looks good. Thank you. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7823 From kvn at openjdk.java.net Thu May 5 15:45:21 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 5 May 2022 15:45:21 GMT Subject: RFR: 8286190: Add test to verify constant folding for Enum fields In-Reply-To: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> References: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> Message-ID: On Thu, 5 May 2022 13:35:37 GMT, Aleksey Shipilev wrote: > There is the [JDK-8161245](https://bugs.openjdk.java.net/browse/JDK-8161245) to make compilers trust Enum final fields. It was implicitly implemented by [JDK-8234049](https://bugs.openjdk.java.net/browse/JDK-8234049), which added the wildcard trust for everything in java/lang: > https://github.com/openjdk/jdk/blob/c5a0687f80367a3a284dfd56781c371826264d3b/src/hotspot/share/ci/ciField.cpp#L230 > > It would be better to have the explicit test that verifies the constant folding of Enum fields indeed happens. > > Additional testing: > - [x] Linux x86_64 fastdebug, new test passes > - [x] Linux x86_64 release, new test passes > - [x] Linux x86_32 fastdebug, new test passes Nice find. So we are doing ENUM constant folding since JDK 14 and did not notice it ;) Test is good. Thanks! ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8551 From kvn at openjdk.java.net Thu May 5 16:01:18 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 5 May 2022 16:01:18 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop In-Reply-To: References: Message-ID: On Thu, 5 May 2022 14:57:11 GMT, Roland Westrelin wrote: > This is another small enhancement for a code shape that showed up in a > MemorySegment micro benchmark. The shape to optimize is the one from test1: > > > for (int i = 0; i < size; i++) { > long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > if (((base + j) & 3) != 0) { > throw new RuntimeException(); > } > > v += UNSAFE.getInt(base + j); > } > > > In that code shape, the loop iv is first scaled, result is then casted > to long, range checked and finally address of memory location is > computed. > > The alignment check is transformed so the loop body has no check In > order to eliminate the range check, that loop is transformed into: > > > for (int i1 = ..) { > for (int i2 = ..) { > long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > v += UNSAFE.getInt(base + j); > } > } > > > The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... > > In this case, the type of the ConvI2L is [min_jint, max_jint] and type > of CastLL is [0, max_jint] (the CastLL has a narrower type). > > I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that > case. The convI2L and CastII types can be set to [0, max_jint]. The > new address shape is then: > > (AddP base (ConvI2L (CastII (LShiftI (AddI ... > > which optimize well. > > (LShiftI (AddI ... > is transformed into > (AddI (LShiftI ... > because one of the AddI input is loop invariant (i2) and we have: > > (AddP base (ConvI2L (CastII (AddI (LShiftI ... > > Then because the ConvI2L and CastII types are [0, max_jint], the AddI > is pushed through the ConvI2L and CastII: > > (AddP base (AddL (ConvI2L (CastII (LShiftI ... > > base and one of the inputs of the AddL are loop invariant so this > transformed into: > > (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... > > The (AddP ...) is loop invariant so computed before entry. The > (ConvI2L ...) only depends on the loop iv. > > The resulting address is a shift + an add. The address before > transformation requires 2 adds + a shift. Also after unrolling, the > adress of the second access in the loop is cheaper to compute as it > can be derived from the address of the first access. > > For all of this to work: > 1) I added a CastLL::Ideal transformation: > (CastLL (ConvI2L into (ConvI2l (CastII > > 2) I also had to prevent split if to transform (LShiftI (Phi for the > iv Phi of a counted loop. > > > test2 and test3 test 1) and 2) separately. Good suggestion. I have comments. src/hotspot/share/opto/castnode.cpp line 385: > 383: const Type* t = Value(phase); > 384: const Type* t_in = phase->type(in1); > 385: if (t != Type::TOP && t_in != Type::TOP && t != t_in) { `t != t_in` does not mean that type is narrower in general case. I think we need to check ranges (types meet?). src/hotspot/share/opto/loopopts.cpp line 1084: > 1082: } > 1083: > 1084: // Check for having no control input; not pinned. Allow Wrong removed space. ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From roland at openjdk.java.net Thu May 5 16:08:13 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 5 May 2022 16:08:13 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop In-Reply-To: References: Message-ID: On Thu, 5 May 2022 15:57:55 GMT, Vladimir Kozlov wrote: > Good suggestion. I have comments. Thanks for reviewing this. > src/hotspot/share/opto/castnode.cpp line 385: > >> 383: const Type* t = Value(phase); >> 384: const Type* t_in = phase->type(in1); >> 385: if (t != Type::TOP && t_in != Type::TOP && t != t_in) { > > `t != t_in` does not mean that type is narrower in general case. I think we need to check ranges (types meet?). Thanks for looking at this. t is the result of Value() which takes the type of its input into account so, AFAICT, there's no way t can be wider than t_in. Am I missing something? If not I could add an assert. What do you think? ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From jbhateja at openjdk.java.net Thu May 5 16:09:20 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 5 May 2022 16:09:20 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:47:47 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - 8284960: Correcting a typo. > - 8284960: Integrating changes from panama-vector (Add @since 19 tags). > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: AARCH64 backend changes. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Integration of JEP 426: Vector API (Fourth Incubator) Hi @vnkozlov , It will be helpful if you can kindly review the changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From vladimir.kozlov at oracle.com Thu May 5 16:16:45 2022 From: vladimir.kozlov at oracle.com (vladimir.kozlov at oracle.com) Date: Thu, 5 May 2022 09:16:45 -0700 Subject: Node::find(int) should not traverse from new to old nodes In-Reply-To: References: Message-ID: <9ec32785-9b68-66cc-349b-0a8ef9cb75e6@oracle.com> I think the original intention to search in old IR is to find any node with specified index. But you are right about inconsistency of implementation. I prefer to have a separate search of old nodes in separate find_old_node() method and remove such search from default find_node() (your Solution 1). It is still useful to look through old IR when debug. Thanks, Vladimir K On 5/5/22 2:25 AM, Emanuel Peter wrote: > Hi, > > I have been bothered by find_node(idx) for a while. When I am looking at the Mach graph, and search for a node with an idx, I sometimes get old, sometimes new nodes. The reason is that Node::find does not just traverse input/output edges, but also debug_orig (if ASSERT is enabled). Via debug_orig, most Mach nodes link to their old node from the IR of previous phases. This way we can find multiple nodes for an idx. Node::find returns the last one it finds - sometimes the new one is last, sometimes the old one is last. At least it prints both pointers to the terminal. > > Here, I have a detailed writeup, and some proposed solutions: > https://bugs.openjdk.java.net/browse/JDK-8286179 > > Do you agree that we should fix this? Would you pick one of my solutions, or propose a new one? > Since this is a tool that probably many people are using for debugging, I do not want to break it for you. > > Best Regards, > Emanuel Peter From kvn at openjdk.java.net Thu May 5 16:50:17 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 5 May 2022 16:50:17 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop In-Reply-To: References: Message-ID: On Thu, 5 May 2022 16:04:39 GMT, Roland Westrelin wrote: >> src/hotspot/share/opto/castnode.cpp line 385: >> >>> 383: const Type* t = Value(phase); >>> 384: const Type* t_in = phase->type(in1); >>> 385: if (t != Type::TOP && t_in != Type::TOP && t != t_in) { >> >> `t != t_in` does not mean that type is narrower in general case. I think we need to check ranges (types meet?). > > Thanks for looking at this. t is the result of Value() which takes the type of its input into account so, AFAICT, there's no way t can be wider than t_in. Am I missing something? If not I could add an assert. What do you think? There is no specialized `CastLLNode::Value()` and `ConstraintCastNode` only calls `filter_speculative()` which do call `join()`. May be it is indeed enough. Yes, would be nice to have an assert to make sure we got it right. ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From kvn at openjdk.java.net Thu May 5 18:26:18 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 5 May 2022 18:26:18 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Are you doing BCI and SP manipulation only to affect result of `liveness_at_bci()` call in `kill_dead_locals()`? May it can be done less disruptive way. I am concern that different `bci` and `sp` may affect correctness of uncommon trap call generation (its debug info). It affects result of `compute_stack_effects()`, `too_many_recompiles()` and `should_reexecute_implied_by_bytecode()`. In addition logs about such uncommon trap will be different. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Thu May 5 19:13:22 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 5 May 2022 19:13:22 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:36:13 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) >> Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics LTGM. I am not a reviewer. We need other reviewers to approve it. src/hotspot/share/opto/escape.cpp line 3794: > 3792: _compile->_local_arg_escape_ctr++; > 3793: } > 3794: else if (ptn->escape_state() == PointsToNode::GlobalEscape) { "else if" style is not consistent with others. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.java.net/jdk/pull/8019 From xliu at openjdk.java.net Thu May 5 19:13:27 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 5 May 2022 19:13:27 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Wed, 4 May 2022 01:23:09 GMT, aamarsh wrote: >> src/hotspot/share/opto/macro.cpp line 2608: >> >>> 2606: } >>> 2607: >>> 2608: int PhaseMacroExpand::count_MemBar() { >> >> I am not sure about this procedural. Even though you use Unique_Node_List, is it still possible to count the same membar multiple times? or maybe a backedge cause an infinitely loop? >> >> I think you can use Compile::identify_useful_nodes to collect all useful nodes and then count MemBar nodes. > > @navyxliu Since no nodes are ever removed from the list, I think this issue is avoided. If we find a loop back to a node we have already searched and attempt to push this node at line 2620, it will not be added to the list because it already exists. okay. you're right. This is BFS but no element is pop. I also ran twice jtreg with JTREG="VM_OPTIONS=-XX:+PrintOptoStatistics", it's safe. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From xliu at openjdk.java.net Thu May 5 19:13:30 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 5 May 2022 19:13:30 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Thu, 5 May 2022 19:05:41 GMT, Xin Liu wrote: >> @navyxliu Since no nodes are ever removed from the list, I think this issue is avoided. If we find a loop back to a node we have already searched and attempt to push this node at line 2620, it will not be added to the list because it already exists. > > okay. you're right. This is BFS but no element is pop. > I also ran twice jtreg with JTREG="VM_OPTIONS=-XX:+PrintOptoStatistics", it's safe. nits: This function seems not to be part of 'PhaseMacroExpand', at least it could be a static member function. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From psandoz at openjdk.java.net Thu May 5 19:30:59 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 5 May 2022 19:30:59 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename "use_predicate" to "needs_predicate" src/hotspot/share/opto/vectorIntrinsics.cpp line 1238: > 1236: } else { > 1237: // Masked vector load with IOOBE always uses the predicated load. > 1238: const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); Should it be `argument(7)`? (and adjustments later to access the container). ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From john.r.rose at oracle.com Thu May 5 20:35:33 2022 From: john.r.rose at oracle.com (John Rose) Date: Thu, 05 May 2022 13:35:33 -0700 Subject: RFR: 8286190: Add test to verify constant folding for Enum fields In-Reply-To: References: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> Message-ID: <62832F36-60AA-4C6B-9CDD-3CBD53DDFC87@oracle.com> This reminds me that our enum switches do not allow constant folding, because of our translation strategy. The root cause is that the T.S. open-codes the discrimination logic for switches, and it happens to use Java arrays, which cannot be constant-folded. There is a way to ?trust? an array element, called `@Stable`, but it can only be used inside a JDK runtime support method, not in a random classfile generated by javac. So the T.S. should encapsulate switch discrimination logic so that a runtime support routine can use appropriate technology, such as stable arrays and/or perfect hashes and/or binary searches. It?s not really javac?s job to build such things, and when it tries it just gets in the way of doing a better job in the runtime. https://bugs.openjdk.java.net/browse/JDK-8161250 tracks this general issue. These points relate to this bug only in that, if we had our ducks in a row with translation of switch, we could test also for constant-folding of switch logic. On 5 May 2022, at 8:45, Vladimir Kozlov wrote: > On Thu, 5 May 2022 13:35:37 GMT, Aleksey Shipilev > wrote: > >> There is the >> [JDK-8161245](https://bugs.openjdk.java.net/browse/JDK-8161245) to >> make compilers trust Enum final fields. It was implicitly implemented >> by [JDK-8234049](https://bugs.openjdk.java.net/browse/JDK-8234049), >> which added the wildcard trust for everything in java/lang: >> https://github.com/openjdk/jdk/blob/c5a0687f80367a3a284dfd56781c371826264d3b/src/hotspot/share/ci/ciField.cpp#L230 >> >> It would be better to have the explicit test that verifies the >> constant folding of Enum fields indeed happens. >> >> Additional testing: >> - [x] Linux x86_64 fastdebug, new test passes >> - [x] Linux x86_64 release, new test passes >> - [x] Linux x86_32 fastdebug, new test passes > > Nice find. So we are doing ENUM constant folding since JDK 14 and did > not notice it ;) > > Test is good. Thanks! > > ------------- > > Marked as reviewed by kvn (Reviewer). > > PR: https://git.openjdk.java.net/jdk/pull/8551 From fjiang at openjdk.java.net Fri May 6 01:08:54 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Fri, 6 May 2022 01:08:54 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 5 May 2022 01:13:27 GMT, Feilong Jiang wrote: > I will test these changes on riscv, the results will be available in about one day. Test looks good. ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From xgong at openjdk.java.net Fri May 6 03:51:01 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 6 May 2022 03:51:01 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 19:27:47 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > src/hotspot/share/opto/vectorIntrinsics.cpp line 1238: > >> 1236: } else { >> 1237: // Masked vector load with IOOBE always uses the predicated load. >> 1238: const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); > > Should it be `argument(7)`? (and adjustments later to access the container). I'm afraid it's `argument(8)` for the load operation since the `argument(7)` is the mask input. It seems the argument number is not right begin from the mask input which is expected to be `6`. But the it's not. Actually I don't quite understand why. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From sviswanathan at openjdk.java.net Fri May 6 04:25:49 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 6 May 2022 04:25:49 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 03:47:47 GMT, Xiaohong Gong wrote: >> src/hotspot/share/opto/vectorIntrinsics.cpp line 1238: >> >>> 1236: } else { >>> 1237: // Masked vector load with IOOBE always uses the predicated load. >>> 1238: const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); >> >> Should it be `argument(7)`? (and adjustments later to access the container). > > I'm afraid it's `argument(8)` for the load operation since the `argument(7)` is the mask input. It seems the argument number is not right begin from the mask input which is expected to be `6`. But the it's not. Actually I don't quite understand why. offset is long so uses two argument slots (5 and 6). mask is argument (7). offsetInRange is argument(8). ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From dlong at openjdk.java.net Fri May 6 04:50:20 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 6 May 2022 04:50:20 GMT Subject: RFR: 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" Message-ID: This test incorrectly assumes calls to System.currentTimeMillis() are monotonic. The only fix I can think of is to remove that test and leave the test for System.nanoTime(). ------------- Commit messages: - remove test for System.currentTimeMillis() because it is not monotonic Changes: https://git.openjdk.java.net/jdk/pull/8566/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8566&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286263 Stats: 15 lines in 1 file changed: 0 ins; 14 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8566.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8566/head:pull/8566 PR: https://git.openjdk.java.net/jdk/pull/8566 From xgong at openjdk.java.net Fri May 6 04:52:56 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 6 May 2022 04:52:56 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 04:22:30 GMT, Sandhya Viswanathan wrote: >> I'm afraid it's `argument(8)` for the load operation since the `argument(7)` is the mask input. It seems the argument number is not right begin from the mask input which is expected to be `6`. But the it's not. Actually I don't quite understand why. > > offset is long so uses two argument slots (5 and 6). > mask is argument (7). > offsetInRange is argument(8). Make sense! Thanks for the explanation! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From thartmann at openjdk.java.net Fri May 6 06:14:47 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 6 May 2022 06:14:47 GMT Subject: RFR: 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" In-Reply-To: References: Message-ID: On Fri, 6 May 2022 04:42:50 GMT, Dean Long wrote: > This test incorrectly assumes calls to System.currentTimeMillis() are monotonic. The only fix I can think of is to remove that test and leave the test for System.nanoTime(). Looks good to me, thanks for fixing. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8566 From tobias.hartmann at oracle.com Fri May 6 06:25:36 2022 From: tobias.hartmann at oracle.com (Tobias Hartmann) Date: Fri, 6 May 2022 08:25:36 +0200 Subject: Node::find(int) should not traverse from new to old nodes In-Reply-To: <9ec32785-9b68-66cc-349b-0a8ef9cb75e6@oracle.com> References: <9ec32785-9b68-66cc-349b-0a8ef9cb75e6@oracle.com> Message-ID: <62bf7cda-5b07-fe62-ca5c-29489d5241ad@oracle.com> Thanks for the write-up. I agree with Vladimir, let's go with two separate find methods. Best regards, Tobias On 05.05.22 18:16, vladimir.kozlov at oracle.com wrote: > I think the original intention to search in old IR is to find any node with specified index. But you > are right about inconsistency of implementation. > > I prefer to have a separate search of old nodes in separate find_old_node() method and remove such > search from default find_node() (your Solution 1). It is still useful to look through old IR when > debug. > > Thanks, > Vladimir K > > On 5/5/22 2:25 AM, Emanuel Peter wrote: >> Hi, >> >> I have been bothered by find_node(idx) for a while. When I am looking at the Mach graph, and >> search for a node with an idx, I sometimes get old, sometimes new nodes. The reason is that >> Node::find does not just traverse input/output edges, but also debug_orig (if ASSERT is enabled). >> Via debug_orig, most Mach nodes link to their old node from the IR of previous phases. This way we >> can find multiple nodes for an idx. Node::find returns the last one it finds - sometimes the new >> one is last, sometimes the old one is last. At least it prints both pointers to the terminal. >> >> Here, I have a detailed writeup, and some proposed solutions: >> https://bugs.openjdk.java.net/browse/JDK-8286179 >> >> Do you agree that we should fix this? Would you pick one of my solutions, or propose a new one? >> Since this is a tool that probably many people are using for debugging, I do not want to break it >> for you. >> >> Best Regards, >> Emanuel Peter From kvn at openjdk.java.net Fri May 6 06:41:58 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 06:41:58 GMT Subject: RFR: 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" In-Reply-To: References: Message-ID: On Fri, 6 May 2022 04:42:50 GMT, Dean Long wrote: > This test incorrectly assumes calls to System.currentTimeMillis() are monotonic. The only fix I can think of is to remove that test and leave the test for System.nanoTime(). good ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8566 From thartmann at openjdk.java.net Fri May 6 06:52:35 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 6 May 2022 06:52:35 GMT Subject: RFR: 8286190: Add test to verify constant folding for Enum fields In-Reply-To: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> References: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> Message-ID: On Thu, 5 May 2022 13:35:37 GMT, Aleksey Shipilev wrote: > There is the [JDK-8161245](https://bugs.openjdk.java.net/browse/JDK-8161245) to make compilers trust Enum final fields. It was implicitly implemented by [JDK-8234049](https://bugs.openjdk.java.net/browse/JDK-8234049), which added the wildcard trust for everything in java/lang: > https://github.com/openjdk/jdk/blob/c5a0687f80367a3a284dfd56781c371826264d3b/src/hotspot/share/ci/ciField.cpp#L230 > > It would be better to have the explicit test that verifies the constant folding of Enum fields indeed happens. > > Additional testing: > - [x] Linux x86_64 fastdebug, new test passes > - [x] Linux x86_64 release, new test passes > - [x] Linux x86_32 fastdebug, new test passes Looks good. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8551 From thartmann at openjdk.java.net Fri May 6 06:54:06 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Fri, 6 May 2022 06:54:06 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. `TestAggressiveLivenessForUnstableIf.java` fails with `-ea -esa -XX:CompileThreshold=100 -XX:+UnlockExperimentalVMOptions -server -XX:-TieredCompilation`: PrintIdeal: 41 ConI === 0 [[ 42 ]] #int:999999 42 CmpI === _ 10 41 [[ 43 ]] !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:10 (line 51) 40 ConI === 0 [[ 81 ]] #int:0 10 Parm === 3 [[ 81 42 ]] Parm0: int !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:-1 (line 48) 43 Bool === _ 42 [[ 81 ]] [le] !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:10 (line 51) 3 Start === 3 0 [[ 3 5 6 7 8 9 10 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:int} 81 CMoveI === _ 43 10 40 [[ 79 ]] #int !orig=[78] !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:20 (line 55) 9 Parm === 3 [[ 79 ]] ReturnAdr !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:-1 (line 48) 8 Parm === 3 [[ 79 ]] FramePtr !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:-1 (line 48) 7 Parm === 3 [[ 79 ]] Memory Memory: @BotPTR *+bot, idx=Bot; !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:-1 (line 48) 6 Parm === 3 [[ 79 ]] I_O !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:-1 (line 48) 5 Parm === 3 [[ 79 ]] Control !jvms: TestAggressiveLivenessForUnstableIf::boxing_object @ bci:-1 (line 48) 79 Return === 5 6 7 8 9 returns 81 [[ 0 ]] 0 Root === 0 79 [[ 0 1 3 40 41 ]] inner Failed IR Rules (1) of Methods (1) ---------------------------------- 1) Method "public static int compiler.c2.irTests.TestAggressiveLivenessForUnstableIf.boxing_object(int)" - [Failed IR rules: 1]: * @IR rule 1: "@compiler.lib.ir_framework.IR(applyIf={}, applyIfAnd={}, failOn={}, applyIfOr={}, counts={"(\\\\d+(\\\\s){2}(CallStaticJava.*)+(\\\\s){2}===.*uncommon_trap.*unstable_if)", "1"}, applyIfNot={})" - counts: Graph contains wrong number of nodes: * Regex 1: (\\d+(\\s){2}(CallStaticJava.*)+(\\s){2}===.*uncommon_trap.*unstable_if) - Failed comparison: [found] 0 = 1 [given] - No nodes matched! ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From roland at openjdk.java.net Fri May 6 08:29:09 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 6 May 2022 08:29:09 GMT Subject: RFR: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop [v7] In-Reply-To: References: Message-ID: On Mon, 2 May 2022 16:06:42 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: >> >> - undo unneeded change >> - Merge branch 'master' into JDK-8281429 >> - redo change removed by error >> - review >> - Merge branch 'master' into JDK-8281429 >> - undo >> - test fix >> - more test >> - test & fix >> - other fix >> - ... and 9 more: https://git.openjdk.java.net/jdk/compare/c82626de...19b38997 > > I am fine with testing range [MIN_VALUE + stride, MAX_VALUE - stride] to exercise unsigned arithmetic. Whatever maximum loopopts allows. @vnkozlov thanks for the review ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From roland at openjdk.java.net Fri May 6 08:29:12 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 6 May 2022 08:29:12 GMT Subject: Integrated: 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop In-Reply-To: References: Message-ID: On Tue, 15 Mar 2022 16:02:54 GMT, Roland Westrelin wrote: > The type for the iv phi of a counted loop is computed from the types > of the phi on loop entry and the type of the limit from the exit > test. Because the exit test is applied to the iv after increment, the > type of the iv phi is at least one less than the limit (for a positive > stride, one more for a negative stride). > > Also, for a stride whose absolute value is not 1 and constant init and > limit values, it's possible to compute accurately the iv phi type. > > This change caused a few failures and I had to make a few adjustments > to loop opts code as well. This pull request has now been integrated. Changeset: fa1ca98f Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/fa1ca98fff66fb91cfd5b00404645e0574d03101 Stats: 404 lines in 7 files changed: 385 ins; 1 del; 18 mod 8281429: PhiNode::Value() is too conservative for tripcount of CountedLoop Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/7823 From duke at openjdk.java.net Fri May 6 08:32:20 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Fri, 6 May 2022 08:32:20 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes Message-ID: **Problem:** `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. **Implemented Solution:** 1. Remove traversing `debug_orig`. 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). I hope this improves your debugging experience. [running sanity tests to see it doesn't break something] ------------- Commit messages: - 8286179: Node::find(int) should not traverse from new to old nodes Changes: https://git.openjdk.java.net/jdk/pull/8567/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8567&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286179 Stats: 34 lines in 1 file changed: 24 ins; 7 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8567.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8567/head:pull/8567 PR: https://git.openjdk.java.net/jdk/pull/8567 From roland at openjdk.java.net Fri May 6 09:18:27 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 6 May 2022 09:18:27 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: > This is another small enhancement for a code shape that showed up in a > MemorySegment micro benchmark. The shape to optimize is the one from test1: > > > for (int i = 0; i < size; i++) { > long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > if (((base + j) & 3) != 0) { > throw new RuntimeException(); > } > > v += UNSAFE.getInt(base + j); > } > > > In that code shape, the loop iv is first scaled, result is then casted > to long, range checked and finally address of memory location is > computed. > > The alignment check is transformed so the loop body has no check In > order to eliminate the range check, that loop is transformed into: > > > for (int i1 = ..) { > for (int i2 = ..) { > long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; > > j = Objects.checkIndex(j, size * 4); > > v += UNSAFE.getInt(base + j); > } > } > > > The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... > > In this case, the type of the ConvI2L is [min_jint, max_jint] and type > of CastLL is [0, max_jint] (the CastLL has a narrower type). > > I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that > case. The convI2L and CastII types can be set to [0, max_jint]. The > new address shape is then: > > (AddP base (ConvI2L (CastII (LShiftI (AddI ... > > which optimize well. > > (LShiftI (AddI ... > is transformed into > (AddI (LShiftI ... > because one of the AddI input is loop invariant (i2) and we have: > > (AddP base (ConvI2L (CastII (AddI (LShiftI ... > > Then because the ConvI2L and CastII types are [0, max_jint], the AddI > is pushed through the ConvI2L and CastII: > > (AddP base (AddL (ConvI2L (CastII (LShiftI ... > > base and one of the inputs of the AddL are loop invariant so this > transformed into: > > (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... > > The (AddP ...) is loop invariant so computed before entry. The > (ConvI2L ...) only depends on the loop iv. > > The resulting address is a shift + an add. The address before > transformation requires 2 adds + a shift. Also after unrolling, the > adress of the second access in the loop is cheaper to compute as it > can be derived from the address of the first access. > > For all of this to work: > 1) I added a CastLL::Ideal transformation: > (CastLL (ConvI2L into (ConvI2l (CastII > > 2) I also had to prevent split if to transform (LShiftI (Phi for the > iv Phi of a counted loop. > > > test2 and test3 test 1) and 2) separately. Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: review ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8555/files - new: https://git.openjdk.java.net/jdk/pull/8555/files/6493429c..a122f0cf Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8555&range=00-01 Stats: 12 lines in 2 files changed: 5 ins; 0 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/8555.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8555/head:pull/8555 PR: https://git.openjdk.java.net/jdk/pull/8555 From roland at openjdk.java.net Fri May 6 09:18:27 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 6 May 2022 09:18:27 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 16:47:13 GMT, Vladimir Kozlov wrote: >> Thanks for looking at this. t is the result of Value() which takes the type of its input into account so, AFAICT, there's no way t can be wider than t_in. Am I missing something? If not I could add an assert. What do you think? > > There is no specialized `CastLLNode::Value()` and `ConstraintCastNode` only calls `filter_speculative()` which do call `join()`. May be it is indeed enough. Yes, would be nice to have an assert to make sure we got it right. The new commit adds an assert ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From roland at openjdk.java.net Fri May 6 09:18:28 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Fri, 6 May 2022 09:18:28 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 15:48:57 GMT, Vladimir Kozlov wrote: >> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: >> >> review > > src/hotspot/share/opto/loopopts.cpp line 1084: > >> 1082: } >> 1083: >> 1084: // Check for having no control input; not pinned. Allow > > Wrong removed space. fixed in new commit ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From bulasevich at openjdk.java.net Fri May 6 09:29:47 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Fri, 6 May 2022 09:29:47 GMT Subject: RFR: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Fri, 6 May 2022 01:05:37 GMT, Feilong Jiang wrote: > > I will test these changes on riscv, the results will be available in about one day. > Test looks good Thank you!! ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From bulasevich at openjdk.java.net Fri May 6 09:33:48 2022 From: bulasevich at openjdk.java.net (Boris Ulasevich) Date: Fri, 6 May 2022 09:33:48 GMT Subject: Integrated: 8285378: Remove unnecessary nop for C1 exception and deopt handler In-Reply-To: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> References: <8SyLV_zXQ5gz0T7LsxjDmRf8BHTbScsFxSZkc8krxpY=.ebc5d1a6-6f3c-4e1d-b1c2-f5d279fbefff@github.com> Message-ID: On Thu, 21 Apr 2022 14:09:08 GMT, Boris Ulasevich wrote: > Each C1 method have two nops in the code body. They originally separated the exception/deopt handler block from the code body to fix a "bug 5/14/1999". Now Exception Handler and Deopt Handler are generated in a separate CodeSegment and these nops in the code body don't really help anyone. > > I checked jtreg tests on the following platforms: > - x86 > - ppc > - arm32 > - aarch64 > > I would be grateful if someone could check my changes on the riscv and s390 platforms. > > > [Verified Entry Point] > 0x0000ffff7c749d40: nop > 0x0000ffff7c749d44: sub x9, sp, #0x20, lsl #12 > 0x0000ffff7c749d48: str xzr, [x9] > 0x0000ffff7c749d4c: sub sp, sp, #0x40 > 0x0000ffff7c749d50: stp x29, x30, [sp, #48] > 0x0000ffff7c749d54: and w0, w2, #0x1 > 0x0000ffff7c749d58: strb w0, [x1, #12] > 0x0000ffff7c749d5c: dmb ishst > 0x0000ffff7c749d60: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d64: add sp, sp, #0x40 > 0x0000ffff7c749d68: ldr x8, [x28, #808] ; {poll_return} > 0x0000ffff7c749d6c: cmp sp, x8 > 0x0000ffff7c749d70: b.hi 0x0000ffff7c749d78 // b.pmore > 0x0000ffff7c749d74: ret > # emit_slow_case_stubs > 0x0000ffff7c749d78: adr x8, 0x0000ffff7c749d68 ; {internal_word} > 0x0000ffff7c749d7c: str x8, [x28, #832] > 0x0000ffff7c749d80: b 0x0000ffff7c697480 ; {runtime_call SafepointBlob} > # Excessive nops: Exception Handler and Deopt Handler prolog > 0x0000ffff7c749d84: nop <---------------------------------------------------------------- > 0x0000ffff7c749d88: nop <---------------------------------------------------------------- > # Unwind handler: the handler to remove the activation from the stack and dispatch to the caller. > 0x0000ffff7c749d8c: ldr x0, [x28, #968] > 0x0000ffff7c749d90: str xzr, [x28, #968] > 0x0000ffff7c749d94: str xzr, [x28, #976] > 0x0000ffff7c749d98: ldp x29, x30, [sp, #48] > 0x0000ffff7c749d9c: add sp, sp, #0x40 > 0x0000ffff7c749da0: b 0x0000ffff7c73e000 ; {runtime_call unwind_exception Runtime1 stub} > # Stubs alignment > 0x0000ffff7c749da4: .inst 0x00000000 ; undefined > 0x0000ffff7c749da8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dac: .inst 0x00000000 ; undefined > 0x0000ffff7c749db0: .inst 0x00000000 ; undefined > 0x0000ffff7c749db4: .inst 0x00000000 ; undefined > 0x0000ffff7c749db8: .inst 0x00000000 ; undefined > 0x0000ffff7c749dbc: .inst 0x00000000 ; undefined > [Exception Handler] > 0x0000ffff7c749dc0: bl 0x0000ffff7c740d00 ; {no_reloc} > 0x0000ffff7c749dc4: dcps1 #0xdeae > 0x0000ffff7c749dc8: .inst 0x853828d8 ; undefined > 0x0000ffff7c749dcc: .inst 0x0000ffff ; undefined > [Deopt Handler Code] > 0x0000ffff7c749dd0: adr x30, 0x0000ffff7c749dd0 > 0x0000ffff7c749dd4: b 0x0000ffff7c6977c0 ; {runtime_call DeoptimizationBlob} This pull request has now been integrated. Changeset: c6eab989 Author: Boris Ulasevich URL: https://git.openjdk.java.net/jdk/commit/c6eab989b7df6fb322aa7f0bd509918633594804 Stats: 75 lines in 6 files changed: 1 ins; 73 del; 1 mod 8285378: Remove unnecessary nop for C1 exception and deopt handler Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.java.net/jdk/pull/8341 From ngasson at openjdk.java.net Fri May 6 10:24:41 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Fri, 6 May 2022 10:24:41 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:47:47 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 10 commits: > > - 8284960: Correcting a typo. > - 8284960: Integrating changes from panama-vector (Add @since 19 tags). > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: AARCH64 backend changes. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Integration of JEP 426: Vector API (Fourth Incubator) `cpu/aarch64` changes look good. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8425 From duke at openjdk.java.net Fri May 6 11:18:48 2022 From: duke at openjdk.java.net (duke) Date: Fri, 6 May 2022 11:18:48 GMT Subject: Withdrawn: 8281518: New optimization: convert "(x|y)-(x^y)" into "x&y" In-Reply-To: References: Message-ID: On Wed, 9 Feb 2022 00:44:08 GMT, Zhiqiang Zang wrote: > Convert `(x|y)-(x^y)` into `x&y`, in `SubINode::Ideal` and `SubLNode::Ideal`. > > The results of the microbenchmark are as follows: > > Baseline: > Benchmark Mode Cnt Score Error Units > SubIdeal_XOrY_Minus_XXorY_.baselineInt avgt 60 0.481 ? 0.003 ns/op > SubIdeal_XOrY_Minus_XXorY_.baselineLong avgt 60 0.482 ? 0.004 ns/op > SubIdeal_XOrY_Minus_XXorY_.testInt avgt 60 0.901 ? 0.007 ns/op > SubIdeal_XOrY_Minus_XXorY_.testLong avgt 60 0.894 ? 0.004 ns/op > > Patch: > Benchmark Mode Cnt Score Error Units > SubIdeal_XOrY_Minus_XXorY_.baselineInt avgt 60 0.480 ? 0.003 ns/op > SubIdeal_XOrY_Minus_XXorY_.baselineLong avgt 60 0.483 ? 0.005 ns/op > SubIdeal_XOrY_Minus_XXorY_.testInt avgt 60 0.600 ? 0.004 ns/op > SubIdeal_XOrY_Minus_XXorY_.testLong avgt 60 0.602 ? 0.004 ns/op This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7395 From emanuel.peter at oracle.com Fri May 6 12:45:51 2022 From: emanuel.peter at oracle.com (Emanuel Peter) Date: Fri, 6 May 2022 12:45:51 +0000 Subject: AW: Node::find(int) should not traverse from new to old nodes In-Reply-To: <62bf7cda-5b07-fe62-ca5c-29489d5241ad@oracle.com> References: <9ec32785-9b68-66cc-349b-0a8ef9cb75e6@oracle.com> <62bf7cda-5b07-fe62-ca5c-29489d5241ad@oracle.com> Message-ID: Hi Vladimir and Tobias, Thanks for the feedback. Exactly, it is helpful to search old and new nodes, but we want better control - something more consistent. You can find my suggested implementation here: https://github.com/openjdk/jdk/pull/8567 I am still very open to suggestions, reviews and feedback :) Thanks, Emanuel ________________________________ Von: Tobias Hartmann Gesendet: Freitag, 6. Mai 2022 08:25 An: Vladimir Kozlov ; Emanuel Peter ; hotspot-compiler-dev at openjdk.java.net Betreff: Re: Node::find(int) should not traverse from new to old nodes Thanks for the write-up. I agree with Vladimir, let's go with two separate find methods. Best regards, Tobias On 05.05.22 18:16, vladimir.kozlov at oracle.com wrote: > I think the original intention to search in old IR is to find any node with specified index. But you > are right about inconsistency of implementation. > > I prefer to have a separate search of old nodes in separate find_old_node() method and remove such > search from default find_node() (your Solution 1). It is still useful to look through old IR when > debug. > > Thanks, > Vladimir K > > On 5/5/22 2:25 AM, Emanuel Peter wrote: >> Hi, >> >> I have been bothered by find_node(idx) for a while. When I am looking at the Mach graph, and >> search for a node with an idx, I sometimes get old, sometimes new nodes. The reason is that >> Node::find does not just traverse input/output edges, but also debug_orig (if ASSERT is enabled). >> Via debug_orig, most Mach nodes link to their old node from the IR of previous phases. This way we >> can find multiple nodes for an idx. Node::find returns the last one it finds - sometimes the new >> one is last, sometimes the old one is last. At least it prints both pointers to the terminal. >> >> Here, I have a detailed writeup, and some proposed solutions: >> https://bugs.openjdk.java.net/browse/JDK-8286179 >> >> Do you agree that we should fix this? Would you pick one of my solutions, or propose a new one? >> Since this is a tool that probably many people are using for debugging, I do not want to break it >> for you. >> >> Best Regards, >> Emanuel Peter From psandoz at openjdk.java.net Fri May 6 15:02:53 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 6 May 2022 15:02:53 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 04:49:39 GMT, Xiaohong Gong wrote: >> offset is long so uses two argument slots (5 and 6). >> mask is argument (7). >> offsetInRange is argument(8). > > Make sense! Thanks for the explanation! Doh! of course. This is not the first and will not be the last time i get caught out by the 2-slot requirement. It may be useful to do this: Node* mask_arg = is_store ? argument(8) : argument(7); ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From kvn at openjdk.java.net Fri May 6 16:08:49 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 16:08:49 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes In-Reply-To: References: Message-ID: On Fri, 6 May 2022 08:24:26 GMT, Emanuel Peter wrote: > **Problem:** > `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. > However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. > > **Implemented Solution:** > 1. Remove traversing `debug_orig`. > 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. > 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). > 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). > > I hope this improves your debugging experience. > [running sanity tests to see it doesn't break something] Good. I have 2 comments. src/hotspot/share/opto/node.cpp line 1616: > 1614: // Call this from debugger, search in old nodes: > 1615: Node* find_old_node(const int idx) { > 1616: return old_root()->find(idx); Need check for `nullptr` for `old_root()` call and may be do nothing since we will get message already. src/hotspot/share/opto/node.cpp line 1631: > 1629: // Call this from debugger, search in old nodes: > 1630: Node* find_old_ctrl(const int idx) { > 1631: return old_root()->find_ctrl(idx); Need `nullptr` check. ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From kvn at openjdk.java.net Fri May 6 16:16:05 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 16:16:05 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: <3RYNOfmyJaNfCpZPhzj0BqofE9WHaAtjXXQqw8Ly2mw=.2c1104af-82ea-4025-a624-f2fc41c6141d@github.com> On Fri, 6 May 2022 09:18:27 GMT, Roland Westrelin wrote: >> This is another small enhancement for a code shape that showed up in a >> MemorySegment micro benchmark. The shape to optimize is the one from test1: >> >> >> for (int i = 0; i < size; i++) { >> long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> if (((base + j) & 3) != 0) { >> throw new RuntimeException(); >> } >> >> v += UNSAFE.getInt(base + j); >> } >> >> >> In that code shape, the loop iv is first scaled, result is then casted >> to long, range checked and finally address of memory location is >> computed. >> >> The alignment check is transformed so the loop body has no check In >> order to eliminate the range check, that loop is transformed into: >> >> >> for (int i1 = ..) { >> for (int i2 = ..) { >> long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> v += UNSAFE.getInt(base + j); >> } >> } >> >> >> The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... >> >> In this case, the type of the ConvI2L is [min_jint, max_jint] and type >> of CastLL is [0, max_jint] (the CastLL has a narrower type). >> >> I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that >> case. The convI2L and CastII types can be set to [0, max_jint]. The >> new address shape is then: >> >> (AddP base (ConvI2L (CastII (LShiftI (AddI ... >> >> which optimize well. >> >> (LShiftI (AddI ... >> is transformed into >> (AddI (LShiftI ... >> because one of the AddI input is loop invariant (i2) and we have: >> >> (AddP base (ConvI2L (CastII (AddI (LShiftI ... >> >> Then because the ConvI2L and CastII types are [0, max_jint], the AddI >> is pushed through the ConvI2L and CastII: >> >> (AddP base (AddL (ConvI2L (CastII (LShiftI ... >> >> base and one of the inputs of the AddL are loop invariant so this >> transformed into: >> >> (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... >> >> The (AddP ...) is loop invariant so computed before entry. The >> (ConvI2L ...) only depends on the loop iv. >> >> The resulting address is a shift + an add. The address before >> transformation requires 2 adds + a shift. Also after unrolling, the >> adress of the second access in the loop is cheaper to compute as it >> can be derived from the address of the first access. >> >> For all of this to work: >> 1) I added a CastLL::Ideal transformation: >> (CastLL (ConvI2L into (ConvI2l (CastII >> >> 2) I also had to prevent split if to transform (LShiftI (Phi for the >> iv Phi of a counted loop. >> >> >> test2 and test3 test 1) and 2) separately. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Good. I will start testing. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8555 From shade at openjdk.java.net Fri May 6 16:34:55 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 6 May 2022 16:34:55 GMT Subject: RFR: 8286190: Add test to verify constant folding for Enum fields In-Reply-To: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> References: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> Message-ID: On Thu, 5 May 2022 13:35:37 GMT, Aleksey Shipilev wrote: > There is the [JDK-8161245](https://bugs.openjdk.java.net/browse/JDK-8161245) to make compilers trust Enum final fields. It was implicitly implemented by [JDK-8234049](https://bugs.openjdk.java.net/browse/JDK-8234049), which added the wildcard trust for everything in java/lang: > https://github.com/openjdk/jdk/blob/c5a0687f80367a3a284dfd56781c371826264d3b/src/hotspot/share/ci/ciField.cpp#L230 > > It would be better to have the explicit test that verifies the constant folding of Enum fields indeed happens. > > Additional testing: > - [x] Linux x86_64 fastdebug, new test passes > - [x] Linux x86_64 release, new test passes > - [x] Linux x86_32 fastdebug, new test passes Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8551 From shade at openjdk.java.net Fri May 6 16:34:56 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 6 May 2022 16:34:56 GMT Subject: Integrated: 8286190: Add test to verify constant folding for Enum fields In-Reply-To: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> References: <477ODXUIboMjkT3uatiKuJZtriZMiGHpfrgfsIaTpfc=.de823dc7-5d3a-4ecc-8569-226e76522f3f@github.com> Message-ID: On Thu, 5 May 2022 13:35:37 GMT, Aleksey Shipilev wrote: > There is the [JDK-8161245](https://bugs.openjdk.java.net/browse/JDK-8161245) to make compilers trust Enum final fields. It was implicitly implemented by [JDK-8234049](https://bugs.openjdk.java.net/browse/JDK-8234049), which added the wildcard trust for everything in java/lang: > https://github.com/openjdk/jdk/blob/c5a0687f80367a3a284dfd56781c371826264d3b/src/hotspot/share/ci/ciField.cpp#L230 > > It would be better to have the explicit test that verifies the constant folding of Enum fields indeed happens. > > Additional testing: > - [x] Linux x86_64 fastdebug, new test passes > - [x] Linux x86_64 release, new test passes > - [x] Linux x86_32 fastdebug, new test passes This pull request has now been integrated. Changeset: 080f3c5d Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/080f3c5d8a2f7b2d13baf98c594d4ace67608fc4 Stats: 70 lines in 1 file changed: 70 ins; 0 del; 0 mod 8286190: Add test to verify constant folding for Enum fields Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8551 From vlivanov at openjdk.java.net Fri May 6 18:30:48 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 6 May 2022 18:30:48 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v2] In-Reply-To: References: Message-ID: On Mon, 28 Feb 2022 14:28:40 GMT, Roland Westrelin wrote: >> Outside the type system code itself, c2 usually assumes that a >> TypeOopPtr or a TypeKlassPtr's java type is fully represented by its >> klass(). To have proper support for interfaces, that can't be true as >> a type needs to be represented by an instance class and a set of >> interfaces. This patch hides the klass() accessor of >> TypeOopPtr/TypeKlassPtr and reworks c2 code that relies on it in a way >> that makes that code suitable for proper interface support in a >> subsequent change. This patch doesn't add proper interface support yet >> and is mostly refactoring. "Mostly" because there are cases where the >> previous logic would use a ciKlass but the new one works with a >> TypeKlassPtr/TypeInstPtr which carries the ciKlass and whether the >> klass is exact or not. That extra bit of information can sometimes >> help and so could result in slightly different decisions. >> >> To remove the klass() accessors, the new logic either relies on: >> >> - new methods of TypeKlassPtr/TypeInstPtr. For instance, instead of: >> toop->klass()->is_subtype_of(other_toop->klass()) >> the new code is: >> toop->is_java_subtype_of(other_toop) >> >> - variants of the klass() accessors for narrower cases like >> TypeInstPtr::instance_klass() (returns _klass except if _klass is an >> interface in which case it returns Object), >> TypeOopPtr::unloaded_klass() (returns _klass but only when the klass >> is unloaed), TypeOopPtr::exact_klass() (returns _klass but only when >> the type is exact). >> >> When I tested this patch, for most changes in this patch, I had the >> previous logic, the new logic and a check that verified that they >> return the same result. I ran as much testing as I could that way. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains seven additional commits since the last revision: > > - review > - Merge branch 'master' into JDK-8275201 > - Merge branch 'master' into JDK-8275201 > - build fix > - Merge branch 'master' into JDK-8275201 > - whitespaces > - remove klass accessor Looks very good! Sorry for the delay with the review. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/6717 From xliu at openjdk.java.net Fri May 6 19:06:50 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 6 May 2022 19:06:50 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. hi, Vladimir and Tobias, Thank you for reviewing this patch. > Are you doing BCI and SP manipulation only to affect result of liveness_at_bci() call in kill_dead_locals()? May it can be done less disruptive way. yes, that is my goal. I tried the less disruptive way. One corner case is that the if bytecode may reference an object which happens to be dead at the next bci. It would be wrong if we avoid restoring it in deoptimization. As a result, I have to parse the if-bytecode. Then I came up the idea that we can redefine the semantic of `unstable_if trap`, which is not part of JVM nor Java language specification. I change 'bci' of unstable_if trap from the if branch to the next bci. My assumption is as follows. I think they are both true. 1. the original bc has no side-effect. 2. reexecution in interpreter takes the other path, or next bci. > I am concern that different bci and sp may affect correctness of uncommon trap call generation (its debug info). It affects result of compute_stack_effects(), too_many_recompiles() and should_reexecute_implied_by_bytecode(). In addition logs about such uncommon trap will be different. I want to change debuginfo. This [example](https://bugs.openjdk.java.net/browse/JDK-8276998?focusedCommentId=14492303&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14492303) shows that we don't need to save the scalarized object in debuginfo. We don't need to repush arguments to stack(sp) for comparison either. Both `compute_stack_effects()` and `should_reexecute_implied_by_bytecode()` are called by `GraphKit::add_safepoint_edges()`. we only call it when current bc() is associated with a function call or an allocation. I don't think if-family bytecodes need to call it. There are 2 cases in Parse::do_if(). We do use `PreserveJVMState` to save both bci and sp for the taken branch, but we don't use PreserveJVMState for non-taken branch. I am truly concerned about it. so far, my explanation is that we only change bci for unstable_if, so stopped() has been true after that. Besides that, it's the last bytecode of a basic block, parser will reset bci when it processes the next basic block. What do you think about it? ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From dlong at openjdk.java.net Fri May 6 19:50:47 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 6 May 2022 19:50:47 GMT Subject: RFR: 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" In-Reply-To: References: Message-ID: On Fri, 6 May 2022 04:42:50 GMT, Dean Long wrote: > This test incorrectly assumes calls to System.currentTimeMillis() are monotonic. The only fix I can think of is to remove that test and leave the test for System.nanoTime(). Thanks Tobias and Vladimir. ------------- PR: https://git.openjdk.java.net/jdk/pull/8566 From dlong at openjdk.java.net Fri May 6 19:50:47 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 6 May 2022 19:50:47 GMT Subject: Integrated: 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" In-Reply-To: References: Message-ID: <8a83aLXx6_rMAy2GVVgkwkdQCdX1wuBK-XHbDdQJUQo=.f1c6bc90-2377-4014-adad-811966885248@github.com> On Fri, 6 May 2022 04:42:50 GMT, Dean Long wrote: > This test incorrectly assumes calls to System.currentTimeMillis() are monotonic. The only fix I can think of is to remove that test and leave the test for System.nanoTime(). This pull request has now been integrated. Changeset: bb52ea68 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/bb52ea6820ee749b1ac07485cf1ef65c40048f13 Stats: 15 lines in 1 file changed: 0 ins; 14 del; 1 mod 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8566 From dcubed at openjdk.java.net Fri May 6 19:58:15 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Fri, 6 May 2022 19:58:15 GMT Subject: Integrated: 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java Message-ID: A trivial fix to ProblemList compiler/c2/irTests/TestEnumFinalFold.java. ------------- Commit messages: - 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java Changes: https://git.openjdk.java.net/jdk/pull/8582/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8582&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286342 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8582.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8582/head:pull/8582 PR: https://git.openjdk.java.net/jdk/pull/8582 From mikael at openjdk.java.net Fri May 6 19:58:15 2022 From: mikael at openjdk.java.net (Mikael Vidstedt) Date: Fri, 6 May 2022 19:58:15 GMT Subject: Integrated: 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java In-Reply-To: References: Message-ID: On Fri, 6 May 2022 19:47:50 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/c2/irTests/TestEnumFinalFold.java. Marked as reviewed by mikael (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8582 From dcubed at openjdk.java.net Fri May 6 19:58:16 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Fri, 6 May 2022 19:58:16 GMT Subject: Integrated: 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java In-Reply-To: References: Message-ID: On Fri, 6 May 2022 19:49:59 GMT, Mikael Vidstedt wrote: >> A trivial fix to ProblemList compiler/c2/irTests/TestEnumFinalFold.java. > > Marked as reviewed by mikael (Reviewer). @vidmik - Thanks for the lightning fast review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8582 From dcubed at openjdk.java.net Fri May 6 19:58:17 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Fri, 6 May 2022 19:58:17 GMT Subject: Integrated: 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java In-Reply-To: References: Message-ID: On Fri, 6 May 2022 19:47:50 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/c2/irTests/TestEnumFinalFold.java. This pull request has now been integrated. Changeset: d8f9686b Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk/commit/d8f9686b123bc9f0521da0cd286726c3b4327abd Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java Reviewed-by: mikael ------------- PR: https://git.openjdk.java.net/jdk/pull/8582 From vlivanov at openjdk.java.net Fri May 6 20:23:48 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 6 May 2022 20:23:48 GMT Subject: RFR: 8282218: C1: Missing side effects of dynamic class loading during constant linkage [v2] In-Reply-To: References: Message-ID: > (The problem is similar to JDK-8282194, but with class loading this time.) > > C1 handles unresolved constants by performing constant resolution at runtime and then putting the constant value into the generated code by patching it. But it treats the not-yet-resolved value as a pure constant without any side effects. > > It's not the case for constants which trigger class loading using custom class loaders. (All non-String constants do that.) > > There are no guarantees that there are no side effects during class loading, so C1 has to be conservative. > > Proposed fix kills memory after accessing not-yet-loaded constant in the context of any non-trusted class loader. > > Testing: hs-tier1 - hs-tier4 Vladimir Ivanov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - helper method to access constant pool tag - Merge branch 'master' into 8282218.c1.class_loading - 8282218: C1: Missing side effects of dynamic class loading during constant linkage ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7612/files - new: https://git.openjdk.java.net/jdk/pull/7612/files/bf442f0f..6f2cf416 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7612&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7612&range=00-01 Stats: 379198 lines in 5868 files changed: 272514 ins; 51118 del; 55566 mod Patch: https://git.openjdk.java.net/jdk/pull/7612.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7612/head:pull/7612 PR: https://git.openjdk.java.net/jdk/pull/7612 From vlivanov at openjdk.java.net Fri May 6 20:23:48 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 6 May 2022 20:23:48 GMT Subject: RFR: 8282218: C1: Missing side effects of dynamic class loading during constant linkage [v2] In-Reply-To: References: Message-ID: On Thu, 24 Feb 2022 19:52:57 GMT, Vladimir Kozlov wrote: >> Vladimir Ivanov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - helper method to access constant pool tag >> - Merge branch 'master' into 8282218.c1.class_loading >> - 8282218: C1: Missing side effects of dynamic class loading during constant linkage > > src/hotspot/share/ci/ciStreams.hpp line 258: > >> 256: >> 257: int index = get_constant_pool_index(); >> 258: constantTag tag = get_raw_pool_tag(index); > > Looks like these lines are the same as in `is_dynamic_constant()`. Can you move them in separate method to avoid duplication? Ok, I introduced a no-arg `get_raw_pool_tag()` helper method. Does it look better? ------------- PR: https://git.openjdk.java.net/jdk/pull/7612 From xliu at openjdk.java.net Fri May 6 21:14:39 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 6 May 2022 21:14:39 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 18:22:32 GMT, Vladimir Kozlov wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Are you doing BCI and SP manipulation only to affect result of `liveness_at_bci()` call in `kill_dead_locals()`? May it can be done less disruptive way. > I am concern that different `bci` and `sp` may affect correctness of uncommon trap call generation (its debug info). It affects result of `compute_stack_effects()`, `too_many_recompiles()` and `should_reexecute_implied_by_bytecode()`. > In addition logs about such uncommon trap will be different. hi, @vnkozlov I agree with you that too_many_recompiles() is wrong in seems_stable_comparison. I will correct this. bool Parse::seems_stable_comparison() const { if (C->too_many_traps(method(), bci(), Deoptimization::Reason_unstable_if)) { return false; } return true; } ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Fri May 6 21:48:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 21:48:41 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Fri, 6 May 2022 19:03:08 GMT, Xin Liu wrote: > > Are you doing BCI and SP manipulation only to affect result of liveness_at_bci() call in kill_dead_locals()? May it can be done less disruptive way. > > yes, that is my goal. I tried the less disruptive way. One corner case is that the if bytecode may reference an object which happens to be dead at the next bci. It would be wrong if we avoid restoring it in deoptimization. As a result, I have to parse the if-bytecode. Got it. So the idea is that we don't need to re-materialize Object in case the path which will be taken after deoptimization in Interpreter will not use it. This seems reasonable optimization. But I am still not sure that current implementation (shift uncommon trap to next bc) is valid. > > Then I came up the idea that we can redefine the semantic of `unstable_if trap`, which is not part of JVM nor Java language specification. I change 'bci' of unstable_if trap from the if branch to the next bci. My assumption is as follows. I think they are both true. > > 1. the original bc has no side-effect. > 2. reexecution in interpreter takes the other path, or next bci. First, for your information we re-execute `if` to update its profiling information. For example, if it was some klass check we want to record new value in MDO. That is why we push arguments of `if` back on stack. New value can happen rarely and we may loose it if not recorded during such deoptimization. I am not sure how your changes verify your assumptions. And I am not clear what you mean in 2. Re-execution in Interpreter will be done according to information in uncommon trap (bci). > > > I am concern that different bci and sp may affect correctness of uncommon trap call generation (its debug info). It affects result of compute_stack_effects(), too_many_recompiles() and should_reexecute_implied_by_bytecode(). > > In addition logs about such uncommon trap will be different. > > I want to change debuginfo. This [example](https://bugs.openjdk.java.net/browse/JDK-8276998?focusedCommentId=14492303&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14492303) shows that we don't need to save the scalarized object in debuginfo. We don't need to repush arguments to stack(sp) for comparison either. If you re-execute `if` you need its arguments on stack. I am not sure why you said we don't. > > Both `compute_stack_effects()` and `should_reexecute_implied_by_bytecode()` are called by `GraphKit::add_safepoint_edges()`. we only call it when current bc() is associated with a function call or an allocation. I don't think if-family bytecodes need to call it. `add_safepoint_edges()` is called for `uncommon_trap` which is generated as runtime call. I agree that `compute_stack_effects()` may not be called because `must_throw` is false in this case. But `should_reexecute_implied_by_bytecode()` could be called I think and you may get incorrect answer from `Interpreter::bytecode_should_reexecute(code)` because you changed BCI. > > There are 2 cases in Parse::do_if(). We do use `PreserveJVMState` to save both bci and sp for the taken branch, but we don't use PreserveJVMState for non-taken branch. I am truly concerned about it. so far, my explanation is that we only change bci for unstable_if, so stopped() has been true after that. Besides that, it's the last bytecode of a basic block, parser will reset bci when it processes the next basic block. What do you think about it? PreserveJVMState is used there only to provide correct starting JVM state for other branch processing. After merge JVM states from both branches are merged. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Fri May 6 21:51:01 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 21:51:01 GMT Subject: RFR: 8282218: C1: Missing side effects of dynamic class loading during constant linkage [v2] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 20:23:48 GMT, Vladimir Ivanov wrote: >> (The problem is similar to JDK-8282194, but with class loading this time.) >> >> C1 handles unresolved constants by performing constant resolution at runtime and then putting the constant value into the generated code by patching it. But it treats the not-yet-resolved value as a pure constant without any side effects. >> >> It's not the case for constants which trigger class loading using custom class loaders. (All non-String constants do that.) >> >> There are no guarantees that there are no side effects during class loading, so C1 has to be conservative. >> >> Proposed fix kills memory after accessing not-yet-loaded constant in the context of any non-trusted class loader. >> >> Testing: hs-tier1 - hs-tier4 > > Vladimir Ivanov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - helper method to access constant pool tag > - Merge branch 'master' into 8282218.c1.class_loading > - 8282218: C1: Missing side effects of dynamic class loading during constant linkage Update looks good. Thanks! ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7612 From kvn at openjdk.java.net Fri May 6 22:15:00 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 22:15:00 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Also in the test you pointed new uncommon trap is placed at `bci` 21 which is after merge point of branches. Such uncommon trap should have information about merged values in debug info. And your changes do not provide that. I think we need to limit this optimization to case when targeted BC is inside branch (`if else` code). ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Fri May 6 22:36:55 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 22:36:55 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Or not move uncommon trap - keep it at `if` bci. Which I prefer but it could be more complex changes as you said. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Fri May 6 22:36:57 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 6 May 2022 22:36:57 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: On Fri, 6 May 2022 21:44:55 GMT, Vladimir Kozlov wrote: > And I am not clear what you mean in 2. Re-execution in Interpreter will be done according to information in uncommon trap (bci). first, I have to admit that I overlook the fact that reexecution will update MDO of if bytecode. For item 2: reexecution in interpreter takes the other path, or next bci. Here is my thought: when HotSpot reinterprets the original bc, it must take the unstable path, otherwise this deoptimization shouldn't happen in the first place. In this patch, I just shift bci of uncommon_trap to next_bci, not only in IR, but also in ScopeDesc! Interpreter will start over with the shifted bci. Here is what I generated for the unstable_if of [Test::foo ](https://bugs.openjdk.java.net/browse/JDK-8276998?focusedCommentId=14492303&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14492303) using this patch. please note that the bci changed from 11 to 21. The message pasted in [JDK-8276998](https://bugs.openjdk.java.net/browse/JDK-8276998) was from my initial patch. I retained bci for uncommon_trap initially. I think this change is simpler. 02f call,static wrapper for: uncommon_trap(reason='unstable_if' action='reinterpret' debug_id='0') # Test::foo @ bci:21 (line 67) L[0]=_ L[1]=_ L[2]=#0 # OopMap {off=52/0x34} 034 stop # ShouldNotReachHere I admit my change misses to update MDO in deoptimization. I think it's fine because HotSpot won't recompile this method again until interpreter evaluates it thousands of times. The MDO still gets updated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Fri May 6 22:43:48 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 6 May 2022 22:43:48 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: <6j5-fkcqNLNpblA4P2i4A2EGN40O3_mprSKLAxt2wSY=.e35166cf-2249-4d48-90e6-8e4dfda941a9@github.com> On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Item 2. Yes, that is what will happened and that is why we may do this optimization. Your original words were confusing. Again, MDO may not be update for rare case even after running in Interpreter for some time. As result recompiled code will be the same and we again hit unc trap. In my additional comment I stated that placing uncommon trap to BC after merge point is wrong. You may not have all info in general cases (several branches merging to the same BC). ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Fri May 6 23:05:40 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 6 May 2022 23:05:40 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: References: Message-ID: <4jZ9dcs41GvrxkGqCbdYu_m3ahDRC0E1LyfXtOAx_Gw=.763af08d-7450-417f-95d3-8951c3b9045e@github.com> On Thu, 5 May 2022 05:30:06 GMT, Xin Liu wrote: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. hi, Vladimir, I really appreciate you look into this issue. your inputs are super helpful. I found that we have multiple options to attack Huawei's problem. I have yet another idea... C2's speculative compilation is fascinating. The motive we would like to prune an infrequent branch should be that we can simplify IR and shrink code size. Back to Test::foo(), the else block is blank... Pruning a trivial basic block actually makes IR more complex. We should not replace a trivial basic block with unstable_if . What Tobias reported is a great example. C2 compiled foo() before it became mature. therefore, C2 refrained his "heroic optimization. it turns out that c2 selects CMOVE and gets simpler, faster and smaller code... thanks, --lx ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From vlivanov at openjdk.java.net Fri May 6 23:13:45 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 6 May 2022 23:13:45 GMT Subject: RFR: 8282218: C1: Missing side effects of dynamic class loading during constant linkage [v2] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 20:23:48 GMT, Vladimir Ivanov wrote: >> (The problem is similar to JDK-8282194, but with class loading this time.) >> >> C1 handles unresolved constants by performing constant resolution at runtime and then putting the constant value into the generated code by patching it. But it treats the not-yet-resolved value as a pure constant without any side effects. >> >> It's not the case for constants which trigger class loading using custom class loaders. (All non-String constants do that.) >> >> There are no guarantees that there are no side effects during class loading, so C1 has to be conservative. >> >> Proposed fix kills memory after accessing not-yet-loaded constant in the context of any non-trusted class loader. >> >> Testing: hs-tier1 - hs-tier4 > > Vladimir Ivanov has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - helper method to access constant pool tag > - Merge branch 'master' into 8282218.c1.class_loading > - 8282218: C1: Missing side effects of dynamic class loading during constant linkage Thanks for the reviews, Vladimir & Tobias. ------------- PR: https://git.openjdk.java.net/jdk/pull/7612 From vlivanov at openjdk.java.net Fri May 6 23:13:47 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 6 May 2022 23:13:47 GMT Subject: Integrated: 8282218: C1: Missing side effects of dynamic class loading during constant linkage In-Reply-To: References: Message-ID: On Thu, 24 Feb 2022 13:51:18 GMT, Vladimir Ivanov wrote: > (The problem is similar to JDK-8282194, but with class loading this time.) > > C1 handles unresolved constants by performing constant resolution at runtime and then putting the constant value into the generated code by patching it. But it treats the not-yet-resolved value as a pure constant without any side effects. > > It's not the case for constants which trigger class loading using custom class loaders. (All non-String constants do that.) > > There are no guarantees that there are no side effects during class loading, so C1 has to be conservative. > > Proposed fix kills memory after accessing not-yet-loaded constant in the context of any non-trusted class loader. > > Testing: hs-tier1 - hs-tier4 This pull request has now been integrated. Changeset: 5212535a Author: Vladimir Ivanov URL: https://git.openjdk.java.net/jdk/commit/5212535a276a92d96ca20bdcfccfbce956febdb1 Stats: 137 lines in 6 files changed: 130 ins; 1 del; 6 mod 8282218: C1: Missing side effects of dynamic class loading during constant linkage Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/7612 From kvn at openjdk.java.net Sat May 7 00:21:48 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 7 May 2022 00:21:48 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: <0Yyl6iZROrT8oNzAh7TMf2Lx2wU_XNkikkUGpxaQPvM=.f0bba3ec-1792-4422-9c60-a464188970c6@github.com> On Fri, 6 May 2022 09:18:27 GMT, Roland Westrelin wrote: >> This is another small enhancement for a code shape that showed up in a >> MemorySegment micro benchmark. The shape to optimize is the one from test1: >> >> >> for (int i = 0; i < size; i++) { >> long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> if (((base + j) & 3) != 0) { >> throw new RuntimeException(); >> } >> >> v += UNSAFE.getInt(base + j); >> } >> >> >> In that code shape, the loop iv is first scaled, result is then casted >> to long, range checked and finally address of memory location is >> computed. >> >> The alignment check is transformed so the loop body has no check In >> order to eliminate the range check, that loop is transformed into: >> >> >> for (int i1 = ..) { >> for (int i2 = ..) { >> long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> v += UNSAFE.getInt(base + j); >> } >> } >> >> >> The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... >> >> In this case, the type of the ConvI2L is [min_jint, max_jint] and type >> of CastLL is [0, max_jint] (the CastLL has a narrower type). >> >> I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that >> case. The convI2L and CastII types can be set to [0, max_jint]. The >> new address shape is then: >> >> (AddP base (ConvI2L (CastII (LShiftI (AddI ... >> >> which optimize well. >> >> (LShiftI (AddI ... >> is transformed into >> (AddI (LShiftI ... >> because one of the AddI input is loop invariant (i2) and we have: >> >> (AddP base (ConvI2L (CastII (AddI (LShiftI ... >> >> Then because the ConvI2L and CastII types are [0, max_jint], the AddI >> is pushed through the ConvI2L and CastII: >> >> (AddP base (AddL (ConvI2L (CastII (LShiftI ... >> >> base and one of the inputs of the AddL are loop invariant so this >> transformed into: >> >> (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... >> >> The (AddP ...) is loop invariant so computed before entry. The >> (ConvI2L ...) only depends on the loop iv. >> >> The resulting address is a shift + an add. The address before >> transformation requires 2 adds + a shift. Also after unrolling, the >> adress of the second access in the loop is cheaper to compute as it >> can be derived from the address of the first access. >> >> For all of this to work: >> 1) I added a CastLL::Ideal transformation: >> (CastLL (ConvI2L into (ConvI2l (CastII >> >> 2) I also had to prevent split if to transform (LShiftI (Phi for the >> iv Phi of a counted loop. >> >> >> test2 and test3 test 1) and 2) separately. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Testing passed. It needs second review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From xgong at openjdk.java.net Sat May 7 01:49:40 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Sat, 7 May 2022 01:49:40 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 14:59:26 GMT, Paul Sandoz wrote: >> Make sense! Thanks for the explanation! > > Doh! of course. This is not the first and will not be the last time i get caught out by the 2-slot requirement. > It may be useful to do this: > > Node* mask_arg = is_store ? argument(8) : argument(7); Yes, the mask argument is got like: Node* mask = unbox_vector(is_store ? argument(8) : argument(7), mbox_type, elem_bt, num_elem); ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From xgong at openjdk.java.net Sat May 7 02:03:36 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Sat, 7 May 2022 02:03:36 GMT Subject: RFR: 8282966: AArch64: Optimize VectorMask.toLong with SVE2 [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 01:31:52 GMT, Eric Liu wrote: >> This patch optimizes the backend implementation of VectorMaskToLong for >> AArch64, given a more efficient approach to mov value bits from >> predicate register to general purpose register as x86 PMOVMSK[1] does, >> by using BEXT[2] which is available in SVE2. >> >> With this patch, the final code (input mask is byte type with >> SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU >> emulator) changes as below: >> >> Before: >> >> mov z16.b, p0/z, #1 >> fmov x0, d16 >> orr x0, x0, x0, lsr #7 >> orr x0, x0, x0, lsr #14 >> orr x0, x0, x0, lsr #28 >> and x0, x0, #0xff >> fmov x8, v16.d[1] >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #8 >> >> orr x8, xzr, #0x2 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #16 >> >> orr x8, xzr, #0x3 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #24 >> >> orr x8, xzr, #0x4 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #32 >> >> mov x8, #0x5 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #40 >> >> orr x8, xzr, #0x6 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #48 >> >> orr x8, xzr, #0x7 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #56 >> >> After: >> >> mov z16.b, p0/z, #1 >> mov z17.b, #1 >> bext z16.d, z16.d, z17.d >> mov z17.d, #0 >> uzp1 z16.s, z16.s, z17.s >> uzp1 z16.h, z16.h, z17.h >> uzp1 z16.b, z16.b, z17.b >> mov x0, v16.d[0] >> >> [1] https://www.felixcloutier.com/x86/pmovmskb >> [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge jdk:master > > Change-Id: Ifa60f3b79513c22dbf932f1da623289687bc1070 > - 8282966: AArch64: Optimize VectorMask.toLong with SVE2 > > This patch optimizes the backend implementation of VectorMaskToLong for > AArch64, given a more efficient approach to mov value bits from > predicate register to general purpose register as x86 PMOVMSK[1] does, > by using BEXT[2] which is available in SVE2. > > With this patch, the final code (input mask is byte type with > SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU > emulator) changes as below: > > Before: > > mov z16.b, p0/z, #1 > fmov x0, d16 > orr x0, x0, x0, lsr #7 > orr x0, x0, x0, lsr #14 > orr x0, x0, x0, lsr #28 > and x0, x0, #0xff > fmov x8, v16.d[1] > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #8 > > orr x8, xzr, #0x2 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #16 > > orr x8, xzr, #0x3 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #24 > > orr x8, xzr, #0x4 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #32 > > mov x8, #0x5 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #40 > > orr x8, xzr, #0x6 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #48 > > orr x8, xzr, #0x7 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #56 > > After: > > mov z16.b, p0/z, #1 > mov z17.b, #1 > bext z16.d, z16.d, z17.d > mov z17.d, #0 > uzp1 z16.s, z16.s, z17.s > uzp1 z16.h, z16.h, z17.h > uzp1 z16.b, z16.b, z17.b > mov x0, v16.d[0] > > [1] https://www.felixcloutier.com/x86/pmovmskb > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- > > Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0 Marked as reviewed by xgong (Committer). Looks good to me. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8337 From jvernee at openjdk.java.net Sat May 7 12:51:12 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Sat, 7 May 2022 12:51:12 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v4] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 - Merge branch 'master' into JEP-19-VM-IMPL2 - 8284161: Implementation of Virtual Threads (Preview) Co-authored-by: Ron Pressler Co-authored-by: Alan Bateman Co-authored-by: Erik ?sterlund Co-authored-by: Andrew Haley Co-authored-by: Rickard B?ckman Co-authored-by: Markus Gr?nlund Co-authored-by: Leonid Mesnik Co-authored-by: Serguei Spitsyn Co-authored-by: Chris Plummer Co-authored-by: Coleen Phillimore Co-authored-by: Robbin Ehn Co-authored-by: Stefan Karlsson Co-authored-by: Thomas Schatzl Co-authored-by: Sergey Kuksenko Reviewed-by: lancea, eosterlund, rehn, sspitsyn, stefank, tschatzl, dfuchs, lmesnik, dcubed, kevinw, amenkov, dlong, mchung, psandoz, bpb, coleenp, smarks, egahlin, mseledtsov, coffeys, darcy - 8282218: C1: Missing side effects of dynamic class loading during constant linkage Reviewed-by: thartmann, kvn - 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java Reviewed-by: mikael - 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" Reviewed-by: thartmann, kvn - 8285295: Need better testing for IdentityHashMap Reviewed-by: jpai, lancea - 8286190: Add test to verify constant folding for Enum fields Reviewed-by: kvn, thartmann - 8286154: Fix 3rd party notices in test files Reviewed-by: darcy, joehw, iris - 8286291: G1: Remove unused segment allocator printouts Reviewed-by: ayang, iwalulya - ... and 80 more: https://git.openjdk.java.net/jdk/compare/f823bf84...5cef96f7 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7959/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=03 Stats: 117182 lines in 1482 files changed: 100895 ins; 8432 del; 7855 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Sat May 7 12:59:05 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Sat, 7 May 2022 12:59:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v5] In-Reply-To: References: Message-ID: <0jKvCItLYrueCki_LnvoP5uRXjLF-a2M5qW6l1Mjpo4=.be3b10da-1c4d-4c41-95db-252ab28ee897@github.com> > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Revert "Merge branch 'master' into JEP-19-VM-IMPL2" This reverts commit 98864b62749f3a482dbb0516a987f38904142042, reversing changes made to a7b9f131c4cc5fbec81811941e5c3e164838a88d. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/5cef96f7..f195789f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=03-04 Stats: 332953 lines in 4896 files changed: 22818 ins; 256179 del; 53956 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Sat May 7 13:05:38 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Sat, 7 May 2022 13:05:38 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v6] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains 20 new commits since the last revision: - Remove unneeded ComputeMoveOrder - Remove comment about native calls in lcm.cpp - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 Reviewed-by: jvernee, mcimadamore - Update riscv and arm stubs - Remove spurious ProblemList change - Pass pointer to LogStream - Polish - Replace TraceNativeInvokers flag with unified logging - Fix other platforms, take 2 - Fix other platforms - ... and 10 more: https://git.openjdk.java.net/jdk/compare/f195789f...e84e3379 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/f195789f..e84e3379 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=04-05 Stats: 222764 lines in 3783 files changed: 157991 ins; 17628 del; 47145 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Sat May 7 13:05:45 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Sat, 7 May 2022 13:05:45 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v4] In-Reply-To: References: Message-ID: On Sat, 7 May 2022 12:51:12 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 90 commits: > > - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 > - Merge branch 'master' into JEP-19-VM-IMPL2 > - 8284161: Implementation of Virtual Threads (Preview) > > Co-authored-by: Ron Pressler > Co-authored-by: Alan Bateman > Co-authored-by: Erik ?sterlund > Co-authored-by: Andrew Haley > Co-authored-by: Rickard B?ckman > Co-authored-by: Markus Gr?nlund > Co-authored-by: Leonid Mesnik > Co-authored-by: Serguei Spitsyn > Co-authored-by: Chris Plummer > Co-authored-by: Coleen Phillimore > Co-authored-by: Robbin Ehn > Co-authored-by: Stefan Karlsson > Co-authored-by: Thomas Schatzl > Co-authored-by: Sergey Kuksenko > Reviewed-by: lancea, eosterlund, rehn, sspitsyn, stefank, tschatzl, dfuchs, lmesnik, dcubed, kevinw, amenkov, dlong, mchung, psandoz, bpb, coleenp, smarks, egahlin, mseledtsov, coffeys, darcy > - 8282218: C1: Missing side effects of dynamic class loading during constant linkage > > Reviewed-by: thartmann, kvn > - 8286342: ProblemList compiler/c2/irTests/TestEnumFinalFold.java > > Reviewed-by: mikael > - 8286263: compiler/c1/TestPinnedIntrinsics.java failed with "RuntimeException: testCurrentTimeMillis failed with -3" > > Reviewed-by: thartmann, kvn > - 8285295: Need better testing for IdentityHashMap > > Reviewed-by: jpai, lancea > - 8286190: Add test to verify constant folding for Enum fields > > Reviewed-by: kvn, thartmann > - 8286154: Fix 3rd party notices in test files > > Reviewed-by: darcy, joehw, iris > - 8286291: G1: Remove unused segment allocator printouts > > Reviewed-by: ayang, iwalulya > - ... and 80 more: https://git.openjdk.java.net/jdk/compare/f823bf84...5cef96f7 I brought in the changes from master after the Virtual Threads integration, but because the PR branch I'm basing on doesn't have those changes, they showed up in the diff. I've undone this mistake by rebasing onto the target branch, which gives a clean diff that should be unchanged from before (but shuffles the commit history to the end of the convo tab). ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From pli at openjdk.java.net Sat May 7 13:31:20 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Sat, 7 May 2022 13:31:20 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 Message-ID: A fuzzer test reports an assertion failure issue with PopulateIndexNode on x86_64. It can be reproduced by the new jtreg case inside this patch. Root cause is that C2 superword creates a PopulateIndexNode by mistake while vectorizing below loop. for (int i = 304; i > 15; i -= 3) { int c = 16; do { for (int t = 1; t < 1; t++) {} arr[c + 1] >>= i; } while (--c > 0); } This is a corner loop case with redundant code inside. After several C2 optimizations, the do-while loop inside is unrolled and then isomorphic right shift statements can be combined in the superword optimization. Since all shift counts are the same loop IV value `i`, superword should generate a RShiftCntVNode to create a vector of scalar replications of the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by mistake because of the `opd == iv()` condition. To fix this, we add a `have_same_inputs` condition here checking if all inputs at position `opd_idx` of nodes in the pack are the same. If true, C2 code should NOT run into this block to generate a PopulateIndexNode. Instead, it should run into the next block for scalar replications. Additionally, only adding this condition here is still not good enough because it breaks the experimental post loop vectorization. As in post loops, all packs are singleton, i.e., `have_same_inputs` is always true. Hence, we also add a pack size check here to make post loop logic run into this block. It's safe to let it go because post loop never needs scalar replications of the loop IV - it never combines nodes in packs. We also add two more assertions in the code. Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 are tested and no issue is found. ------------- Commit messages: - 8286125: C2: "bad AD file" with PopulateIndex on x86_64 Changes: https://git.openjdk.java.net/jdk/pull/8587/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8587&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286125 Stats: 67 lines in 2 files changed: 64 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8587.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8587/head:pull/8587 PR: https://git.openjdk.java.net/jdk/pull/8587 From duke at openjdk.java.net Mon May 9 10:11:38 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 10:11:38 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: > **Problem:** > `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. > However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. > > **Implemented Solution:** > 1. Remove traversing `debug_orig`. > 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. > 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). > 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). > > I hope this improves your debugging experience. > [running sanity tests to see it doesn't break something] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: added nullprt check for old_root ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8567/files - new: https://git.openjdk.java.net/jdk/pull/8567/files/4e422297..de749d57 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8567&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8567&range=00-01 Stats: 6 lines in 1 file changed: 4 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8567.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8567/head:pull/8567 PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 10:11:39 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 10:11:39 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: <3vYyPNKW0dtVovNzt9fnWSKJDSaAbewl0xJfAF_oBIU=.ff55bf1a-95cc-4565-90d8-a69f47c88d1d@github.com> On Fri, 6 May 2022 15:59:17 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> added nullprt check for old_root > > src/hotspot/share/opto/node.cpp line 1616: > >> 1614: // Call this from debugger, search in old nodes: >> 1615: Node* find_old_node(const int idx) { >> 1616: return old_root()->find(idx); > > Need check for `nullptr` for `old_root()` call and may be do nothing since we will get message already. Thanks @vnkozlov I added an assert > src/hotspot/share/opto/node.cpp line 1631: > >> 1629: // Call this from debugger, search in old nodes: >> 1630: Node* find_old_ctrl(const int idx) { >> 1631: return old_root()->find_ctrl(idx); > > Need `nullptr` check. done. ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From jvernee at openjdk.java.net Mon May 9 10:28:27 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Mon, 9 May 2022 10:28:27 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 - Remove unneeded ComputeMoveOrder - Remove comment about native calls in lcm.cpp - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 Reviewed-by: jvernee, mcimadamore - Update riscv and arm stubs - Remove spurious ProblemList change - Pass pointer to LogStream - Polish - Replace TraceNativeInvokers flag with unified logging - Fix other platforms, take 2 - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7959/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=06 Stats: 6934 lines in 157 files changed: 2678 ins; 3218 del; 1038 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From chagedorn at openjdk.java.net Mon May 9 13:14:08 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 9 May 2022 13:14:08 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 10:11:38 GMT, Emanuel Peter wrote: >> **Problem:** >> `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. >> However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. >> >> **Implemented Solution:** >> 1. Remove traversing `debug_orig`. >> 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. >> 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). >> 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). >> >> I hope this improves your debugging experience. >> [running sanity tests to see it doesn't break something] > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > added nullprt check for old_root Otherwise, it looks good to me. src/hotspot/share/opto/node.cpp line 1617: > 1615: Node* find_old_node(const int idx) { > 1616: Node* root = old_root(); > 1617: assert(root != nullptr, "must have old_root() to find old nodes"); I think it's better to avoid assertions here and below and do nothing instead (you already print a warning in `old_root()` which is fine I think) since it would crash and stop the current debugging session. ------------- Changes requested by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 13:58:00 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 13:58:00 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 13:04:03 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> added nullprt check for old_root > > src/hotspot/share/opto/node.cpp line 1617: > >> 1615: Node* find_old_node(const int idx) { >> 1616: Node* root = old_root(); >> 1617: assert(root != nullptr, "must have old_root() to find old nodes"); > > I think it's better to avoid assertions here and below and do nothing instead (you already print a warning in `old_root()` which is fine I think) since it would crash and stop the current debugging session. @chhagedorn I added this `nullptr` because @vnkozlov asked for one. https://github.com/openjdk/jdk/pull/8567#discussion_r866966586 Is there an alternative to an assert? I realize, that in `rr` the debuggin session is not stopped for me - it just unwinds - and I can continue debugging as if nothing happened. However in `gdb` this does crash the debugging session - at least by default. You can change that behavior with `set unwindonsignal on`. There are other asserts that can be triggered. For example if we call `find_node(1)` from a non-compiler thread, we get an assert when internally `Compile::current()` comes across this https://github.com/openjdk/jdk/blob/b849efdf154552903faaddd69cac1fe5f1ddf18a/src/hotspot/share/compiler/compilerThread.hpp#L63-L64 ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 14:01:50 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 14:01:50 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 13:54:46 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/node.cpp line 1617: >> >>> 1615: Node* find_old_node(const int idx) { >>> 1616: Node* root = old_root(); >>> 1617: assert(root != nullptr, "must have old_root() to find old nodes"); >> >> I think it's better to avoid assertions here and below and do nothing instead (you already print a warning in `old_root()` which is fine I think) since it would crash and stop the current debugging session. > > @chhagedorn I added this `nullptr` because @vnkozlov asked for one. https://github.com/openjdk/jdk/pull/8567#discussion_r866966586 > Is there an alternative to an assert? > I realize, that in `rr` the debuggin session is not stopped for me - it just unwinds - and I can continue debugging as if nothing happened. However in `gdb` this does crash the debugging session - at least by default. You can change that behavior with `set unwindonsignal on`. > > There are other asserts that can be triggered. For example if we call `find_node(1)` from a non-compiler thread, we get an assert when internally `Compile::current()` comes across this > > https://github.com/openjdk/jdk/blob/b849efdf154552903faaddd69cac1fe5f1ddf18a/src/hotspot/share/compiler/compilerThread.hpp#L63-L64 Aha. A possible alternative would be to simply return `nullptr`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 14:07:48 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 14:07:48 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 13:04:03 GMT, Christian Hagedorn wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> added nullprt check for old_root > > src/hotspot/share/opto/node.cpp line 1617: > >> 1615: Node* find_old_node(const int idx) { >> 1616: Node* root = old_root(); >> 1617: assert(root != nullptr, "must have old_root() to find old nodes"); > > I think it's better to avoid assertions here and below and do nothing instead (you already print a warning in `old_root()` which is fine I think) since it would crash and stop the current debugging session. Ok, personally verified with @chhagedorn , his idea was to return `nullptr`, will do that now. ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 14:11:33 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 14:11:33 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v3] In-Reply-To: References: Message-ID: > **Problem:** > `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. > However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. > > **Implemented Solution:** > 1. Remove traversing `debug_orig`. > 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. > 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). > 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). > > I hope this improves your debugging experience. > [running sanity tests to see it doesn't break something] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: converted nullptr asserts in nullptr returns ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8567/files - new: https://git.openjdk.java.net/jdk/pull/8567/files/de749d57..67cc9d1d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8567&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8567&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 2 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8567.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8567/head:pull/8567 PR: https://git.openjdk.java.net/jdk/pull/8567 From chagedorn at openjdk.java.net Mon May 9 14:20:25 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 9 May 2022 14:20:25 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v2] In-Reply-To: References: Message-ID: <9oe24LNRirlZSaY6hkDGMvD6kcPGjDHEvz_V7Jh-iMc=.3a3f1667-66b1-422c-aa82-5d2a0ca722bd@github.com> On Mon, 9 May 2022 14:04:08 GMT, Emanuel Peter wrote: >> src/hotspot/share/opto/node.cpp line 1617: >> >>> 1615: Node* find_old_node(const int idx) { >>> 1616: Node* root = old_root(); >>> 1617: assert(root != nullptr, "must have old_root() to find old nodes"); >> >> I think it's better to avoid assertions here and below and do nothing instead (you already print a warning in `old_root()` which is fine I think) since it would crash and stop the current debugging session. > > Ok, personally verified with @chhagedorn , his idea was to return `nullptr`, will do that now. I did not know about `set unwindonsignal on`, that's good trick! I agree that you can still trigger assertions while debugging but I think we should try to reduce that chance as much as we can with sane input values. ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From chagedorn at openjdk.java.net Mon May 9 14:20:22 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 9 May 2022 14:20:22 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v3] In-Reply-To: References: Message-ID: <3t_cy89LqexcRTeGmB3o3LBJFjNsNSpuLVaw0IEMBWE=.cc7c1eb2-4169-4abf-ad61-b39dda01f69a@github.com> On Mon, 9 May 2022 14:11:33 GMT, Emanuel Peter wrote: >> **Problem:** >> `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. >> However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. >> >> **Implemented Solution:** >> 1. Remove traversing `debug_orig`. >> 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. >> 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). >> 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). >> >> I hope this improves your debugging experience. >> [running sanity tests to see it doesn't break something] > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > converted nullptr asserts in nullptr returns That looks good, thanks for doing the updates! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8567 From kvn at openjdk.java.net Mon May 9 14:20:22 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 9 May 2022 14:20:22 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v3] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 14:11:33 GMT, Emanuel Peter wrote: >> **Problem:** >> `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. >> However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. >> >> **Implemented Solution:** >> 1. Remove traversing `debug_orig`. >> 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. >> 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). >> 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). >> >> I hope this improves your debugging experience. >> [running sanity tests to see it doesn't break something] > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > converted nullptr asserts in nullptr returns Sorry I was not clear. I asked for nullptr check you have in final version and not assert. It is wrong to have an assert in interactive function used in debugger. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 14:40:02 2022 From: duke at openjdk.java.net (limck599) Date: Mon, 9 May 2022 14:40:02 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v3] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 14:11:33 GMT, Emanuel Peter wrote: >> **Problem:** >> `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. >> However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. >> >> **Implemented Solution:** >> 1. Remove traversing `debug_orig`. >> 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. >> 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). >> 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). >> >> I hope this improves your debugging experience. >> [running sanity tests to see it doesn't break something] > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > converted nullptr asserts in nullptr returns Marked as reviewed by limck599 at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Mon May 9 15:28:50 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 9 May 2022 15:28:50 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v3] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 14:17:11 GMT, Vladimir Kozlov wrote: >> Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: >> >> converted nullptr asserts in nullptr returns > > Sorry I was not clear. I asked for nullptr check you have in final version and not assert. It is wrong to have an assert in interactive function used in debugger. Thanks @vnkozlov @chhagedorn for the reviews and comments. Thanks @TobiHartmann for the conversations leading to this solution. ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From mdoerr at openjdk.java.net Mon May 9 16:15:50 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 9 May 2022 16:15:50 GMT Subject: RFR: 8285733: [s390] Vector Instruction Emitters for element-wise access are broken In-Reply-To: References: Message-ID: On Wed, 4 May 2022 14:42:27 GMT, Lutz Schmidt wrote: > Please review this rather simple pull request. It fixes some vector instruction emitters. The bugs had gone unnoticed so far because the emitters had not been used. Therefore, the fix bears no risk. > > Testing was performed with new code currently under development. Fix LGTM. I guess this platform is no longer usable after Loom integration. We may need to test it in an updates release. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8537 From kvn at openjdk.java.net Mon May 9 17:17:48 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 9 May 2022 17:17:48 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 In-Reply-To: References: Message-ID: On Sat, 7 May 2022 13:23:54 GMT, Pengfei Li wrote: > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Seems reasonable. Let me test it. Meanwhile, please, update it to latest JDK. ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From dcubed at openjdk.java.net Mon May 9 21:30:10 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 9 May 2022 21:30:10 GMT Subject: RFR: 8286442: ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode Message-ID: A trivial fix to ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode. ------------- Commit messages: - 8286442: ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode Changes: https://git.openjdk.java.net/jdk/pull/8612/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8612&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286442 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8612.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8612/head:pull/8612 PR: https://git.openjdk.java.net/jdk/pull/8612 From ctornqvi at openjdk.java.net Mon May 9 21:32:16 2022 From: ctornqvi at openjdk.java.net (Christian Tornqvist) Date: Mon, 9 May 2022 21:32:16 GMT Subject: RFR: 8286442: ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode In-Reply-To: References: Message-ID: <9tq4e18r_G__6IPrgDpvtzEAosFI77JuqeIUddo5tqQ=.ee72c7cd-67e1-4944-beb8-06df1f5d8964@github.com> On Mon, 9 May 2022 21:23:12 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode. Marked as reviewed by ctornqvi (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8612 From dcubed at openjdk.java.net Mon May 9 21:37:54 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 9 May 2022 21:37:54 GMT Subject: RFR: 8286442: ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode In-Reply-To: <9tq4e18r_G__6IPrgDpvtzEAosFI77JuqeIUddo5tqQ=.ee72c7cd-67e1-4944-beb8-06df1f5d8964@github.com> References: <9tq4e18r_G__6IPrgDpvtzEAosFI77JuqeIUddo5tqQ=.ee72c7cd-67e1-4944-beb8-06df1f5d8964@github.com> Message-ID: On Mon, 9 May 2022 21:29:59 GMT, Christian Tornqvist wrote: >> A trivial fix to ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode. > > Marked as reviewed by ctornqvi (Reviewer). @ctornqvi - Thanks for the fast review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8612 From dcubed at openjdk.java.net Mon May 9 21:37:56 2022 From: dcubed at openjdk.java.net (Daniel D.Daugherty) Date: Mon, 9 May 2022 21:37:56 GMT Subject: Integrated: 8286442: ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode In-Reply-To: References: Message-ID: On Mon, 9 May 2022 21:23:12 GMT, Daniel D. Daugherty wrote: > A trivial fix to ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode. This pull request has now been integrated. Changeset: c28a6361 Author: Daniel D. Daugherty URL: https://git.openjdk.java.net/jdk/commit/c28a63617dd64e009df8b548d58d2dd72579a3ad Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod 8286442: ProblemList compiler/c2/irTests/TestSkeletonPredicates.java in -Xcomp mode Reviewed-by: ctornqvi ------------- PR: https://git.openjdk.java.net/jdk/pull/8612 From psandoz at openjdk.java.net Mon May 9 21:58:54 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 9 May 2022 21:58:54 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename "use_predicate" to "needs_predicate" I did modified the code of this PR to avoid the conversion of `boolean` to `int` and the masked load is made intrinsic from the method at which the constants are passed as arguments i.e. the public `fromArray` mask accepting method. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From kvn at openjdk.java.net Mon May 9 23:16:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 9 May 2022 23:16:54 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 In-Reply-To: References: Message-ID: On Sat, 7 May 2022 13:23:54 GMT, Pengfei Li wrote: > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. My tier1-4 testing passed clean. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8587 From kvn at openjdk.java.net Tue May 10 00:52:36 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 10 May 2022 00:52:36 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:36:13 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 372, Arg escape = 74, Global escape = 1855 (EA executed in 10.49 seconds) >> Objects scalar replaced = 240, Monitor objects removed = 44, GC barriers removed = 37, Memory barriers removed = 284 > > aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > adding escape analysis and scalar replacement statistics I have few comments. src/hotspot/share/opto/compile.cpp line 2217: > 2215: Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); > 2216: Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); > 2217: #endif These should be done inside `ConnectionGraph` - don't expose EA counters to an other class. You can use a static method in `ConnectionGraph` to do that. `_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced` is wrong. You are doubling number because `total_scalar_replaced` is part of `_local_no_escape_ctr`. Keep these numbers separate. Also `mexp._local_scalar_replaced` could be update later during `PhaseMacroExpand::expand_macro_nodes()` call after loop optimizations. And such collection is not accurate (over-counted) due to EA iterations - each iteration may add the same numbers. Which could be fine if you say that in comments so people know. src/hotspot/share/opto/escape.cpp line 128: > 126: // casting jlong to long since Atomic needs Integral type > 127: Atomic::add(&ConnectionGraph::_time_elapsed, (long)et.milliseconds()); > 128: #endif You should check `PrintOptoStatistic` flag. src/hotspot/share/opto/escape.cpp line 248: > 246: #ifndef PRODUCT > 247: escape_state_statistics(java_objects_worklist); > 248: #endif You can use `NOT_PRODUCT()` macro for one line which you have a lot in these changes. src/hotspot/share/opto/escape.cpp line 3781: > 3779: } > 3780: > 3781: void ConnectionGraph::escape_state_statistics(GrowableArray& java_objects_worklist) { You should check `PrintOptoStatistic` flag to avoid useless work. src/hotspot/share/opto/escape.cpp line 3784: > 3782: _compile->_local_no_escape_ctr = 0; > 3783: _compile->_local_arg_escape_ctr = 0; > 3784: _compile->_local_global_escape_ctr = 0; I don't think you need these `Compile` counters - make them local to update corresponding `ConnectionGraph` counters after loop. As I said these counters are not accurate anyway. Unless you want to track for which allocation which counter was recorded. Which I think will be over-kill because the most EA done with one iteration. src/hotspot/share/opto/macro.hpp line 216: > 214: static int _GC_barriers_removed_counter; > 215: static int _memory_barriers_removed_counter; > 216: int _local_scalar_replaced; You don't need `_local_scalar_replaced` value as I said in an other comment. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Tue May 10 00:52:37 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 10 May 2022 00:52:37 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v7] In-Reply-To: References: <3_nhaxzU2R-tZYRIUFq0qIxVpy0KX0ilkVpvCekM5zE=.2fb159c1-d2ed-4d31-98c8-e0a49fb59937@github.com> Message-ID: On Thu, 5 May 2022 19:10:56 GMT, Xin Liu wrote: >> okay. you're right. This is BFS but no element is pop. >> I also ran twice jtreg with JTREG="VM_OPTIONS=-XX:+PrintOptoStatistics", it's safe. > > nits: > This function seems not to be part of 'PhaseMacroExpand', at least it could be a static member function. You should check `PrintOptoStatistic` flag. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From xgong at openjdk.java.net Tue May 10 01:20:40 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Tue, 10 May 2022 01:20:40 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: <_QEcTANm1mniixGLtt_oJ7O97TbPRriNKizi6MCViiM=.70b10358-a8ae-440a-a3a2-ff0fefd3d0b3@github.com> On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz wrote: > I modified the code of this PR to avoid the conversion of `boolean` to `int`, so a constant integer value is passed all the way through, and the masked load is made intrinsic from the method at which the constants are passed as arguments i.e. the public `fromArray` mask accepting method. Great and thanks! Could you please show me the changes or an example? I can push the changes to this PR. Thanks so much! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From pli at openjdk.java.net Tue May 10 05:09:44 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 10 May 2022 05:09:44 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v2] In-Reply-To: References: Message-ID: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Merge branch 'master' into slpfix - 8286125: C2: "bad AD file" with PopulateIndex on x86_64 A fuzzer test reports an assertion failure issue with PopulateIndexNode on x86_64. It can be reproduced by the new jtreg case inside this patch. Root cause is that C2 superword creates a PopulateIndexNode by mistake while vectorizing below loop. for (int i = 304; i > 15; i -= 3) { int c = 16; do { for (int t = 1; t < 1; t++) {} arr[c + 1] >>= i; } while (--c > 0); } This is a corner loop case with redundant code inside. After several C2 optimizations, the do-while loop inside is unrolled and then isomorphic right shift statements can be combined in the superword optimization. Since all shift counts are the same loop IV value `i`, superword should generate a RShiftCntVNode to create a vector of scalar replications of the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by mistake because of the `opd == iv()` condition. To fix this, we add a `have_same_inputs` condition here checking if all inputs at position `opd_idx` of nodes in the pack are the same. If true, C2 code should NOT run into this block to generate a PopulateIndexNode. Instead, it should run into the next block for scalar replications. Additionally, only adding this condition here is still not good enough because it breaks the experimental post loop vectorization. As in post loops, all packs are singleton, i.e., `have_same_inputs` is always true. Hence, we also add a pack size check here to make post loop logic run into this block. It's safe to let it go because post loop never needs scalar replications of the loop IV - it never combines nodes in packs. We also add two more assertions in the code. Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 are tested and no issue is found. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8587/files - new: https://git.openjdk.java.net/jdk/pull/8587/files/e91eb0bf..3af03a80 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8587&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8587&range=00-01 Stats: 101523 lines in 1186 files changed: 92290 ins; 4348 del; 4885 mod Patch: https://git.openjdk.java.net/jdk/pull/8587.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8587/head:pull/8587 PR: https://git.openjdk.java.net/jdk/pull/8587 From pli at openjdk.java.net Tue May 10 05:09:44 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 10 May 2022 05:09:44 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 In-Reply-To: References: Message-ID: On Mon, 9 May 2022 17:13:57 GMT, Vladimir Kozlov wrote: > Meanwhile, please, update it to latest JDK. Thanks @vnkozlov for looking at this. Patch is rebased to JDK master. ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From kvn at openjdk.java.net Tue May 10 05:45:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 10 May 2022 05:45:43 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v2] In-Reply-To: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> References: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> Message-ID: On Tue, 10 May 2022 05:09:44 GMT, Pengfei Li wrote: >> A fuzzer test reports an assertion failure issue with PopulateIndexNode >> on x86_64. It can be reproduced by the new jtreg case inside this patch. >> Root cause is that C2 superword creates a PopulateIndexNode by mistake >> while vectorizing below loop. >> >> for (int i = 304; i > 15; i -= 3) { >> int c = 16; >> do { >> for (int t = 1; t < 1; t++) {} >> arr[c + 1] >>= i; >> } while (--c > 0); >> } >> >> This is a corner loop case with redundant code inside. After several C2 >> optimizations, the do-while loop inside is unrolled and then isomorphic >> right shift statements can be combined in the superword optimization. >> Since all shift counts are the same loop IV value `i`, superword should >> generate a RShiftCntVNode to create a vector of scalar replications of >> the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by >> mistake because of the `opd == iv()` condition. >> >> To fix this, we add a `have_same_inputs` condition here checking if all >> inputs at position `opd_idx` of nodes in the pack are the same. If true, >> C2 code should NOT run into this block to generate a PopulateIndexNode. >> Instead, it should run into the next block for scalar replications. >> >> Additionally, only adding this condition here is still not good enough >> because it breaks the experimental post loop vectorization. As in post >> loops, all packs are singleton, i.e., `have_same_inputs` is always true. >> Hence, we also add a pack size check here to make post loop logic run >> into this block. It's safe to let it go because post loop never needs >> scalar replications of the loop IV - it never combines nodes in packs. >> >> We also add two more assertions in the code. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. > > Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into slpfix > - 8286125: C2: "bad AD file" with PopulateIndex on x86_64 > > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. You need second review. It is not trivial. ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From rcastanedalo at openjdk.java.net Tue May 10 07:27:18 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 10 May 2022 07:27:18 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 Message-ID: This changeset lowers the priority of locally-dependent CreateEx nodes, that is CreateEx nodes that are not initially ready for scheduling in LCM. The proposed scheme assigns them the same priority as projection nodes when selecting the next node to be scheduled, restoring the relative prioritization between projections and CreateEx nodes to the state it was before [JDK-8270090](https://bugs.openjdk.java.net/browse/JDK-8270090). JDK-8270090 wrongly gave all CreateEx nodes the highest priority, which leads to failures whenever projection nodes are expected to get higher priority than locally-dependent CreateEx nodes. See the [JBS issue report](https://bugs.openjdk.java.net/browse/JDK-8285820) for further detail. More specifically, the current ranking to select the next node to be scheduled in `PhaseCFG::select()` is: 1. CreateEx nodes (initially ready or not) 2. Projections 3. Constants and CheckCastPP nodes (tie) 4. ... After this changeset, the ranking becomes: 1. Initially ready CreateEx nodes 2. Projections and other CreateEx nodes (tie) 3. Constants and CheckCastPP nodes (tie) 4. ... which still addresses the issue handled by JDK-8270090 but in a form that is closer to the original ranking before JDK-8270090: 1. Initially ready CreateEx nodes 2. Projections, other CreateEx nodes, constants and CheckCastPP nodes (tie) 3. ... This changeset implements the minimal changes to restore the relative prioritization between CreateEx nodes and projections to the state it was before JDK-8270090, for risk minimization and ease of backporting. I will file a separate RFE proposing a more robust alternative than altering the order of the LCM worklist for ensuring that initially ready CreateEx nodes are scheduled at the block start. #### Testing ##### Functionality - Original failure on x86_32 using `-XX:+UseShenandoahGC` (thanks to Aleksey Shipilev for testing). - Original failure of JDK-8270090 on arm32 (thanks to Marc Hoffmann for testing). - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with StressLCM and StressGCM (5 different seeds). ##### Performance Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on linux-x64, linux-aarch64, windows-x64, and macosx-x64. No significant regression was observed. ------------- Commit messages: - Give projections the same priority as locally-dependent CreateEx nodes Changes: https://git.openjdk.java.net/jdk/pull/8568/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8568&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285820 Stats: 30 lines in 1 file changed: 11 ins; 9 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/8568.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8568/head:pull/8568 PR: https://git.openjdk.java.net/jdk/pull/8568 From shade at openjdk.java.net Tue May 10 07:27:18 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 10 May 2022 07:27:18 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Fri, 6 May 2022 10:42:52 GMT, Roberto Casta?eda Lozano wrote: > This changeset lowers the priority of locally-dependent CreateEx nodes, that is CreateEx nodes that are not initially ready for scheduling in LCM. The proposed scheme assigns them the same priority as projection nodes when selecting the next node to be scheduled, restoring the relative prioritization between projections and CreateEx nodes to the state it was before [JDK-8270090](https://bugs.openjdk.java.net/browse/JDK-8270090). JDK-8270090 wrongly gave all CreateEx nodes the highest priority, which leads to failures whenever projection nodes are expected to get higher priority than locally-dependent CreateEx nodes. See the [JBS issue report](https://bugs.openjdk.java.net/browse/JDK-8285820) for further detail. > > More specifically, the current ranking to select the next node to be scheduled in `PhaseCFG::select()` is: > > 1. CreateEx nodes (initially ready or not) > 2. Projections > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > After this changeset, the ranking becomes: > > 1. Initially ready CreateEx nodes > 2. Projections and other CreateEx nodes (tie) > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > which still addresses the issue handled by JDK-8270090 but in a form that is closer to the original ranking before JDK-8270090: > > 1. Initially ready CreateEx nodes > 2. Projections, other CreateEx nodes, constants and CheckCastPP nodes (tie) > 3. ... > > This changeset implements the minimal changes to restore the relative prioritization between CreateEx nodes and projections to the state it was before JDK-8270090, for risk minimization and ease of backporting. I will file a separate RFE proposing a more robust alternative than altering the order of the LCM worklist for ensuring that initially ready CreateEx nodes are scheduled at the block start. > > #### Testing > > ##### Functionality > > - Original failure on x86_32 using `-XX:+UseShenandoahGC` (thanks to Aleksey Shipilev for testing). > - Original failure of JDK-8270090 on arm32 (thanks to Marc Hoffmann for testing). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with StressLCM and StressGCM (5 different seeds). > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on linux-x64, linux-aarch64, windows-x64, and macosx-x64. No significant regression was observed. This seems to pass x86_32 `tier{1,2,3}` with Shenandoah. ------------- PR: https://git.openjdk.java.net/jdk/pull/8568 From rcastanedalo at openjdk.java.net Tue May 10 07:27:18 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 10 May 2022 07:27:18 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Fri, 6 May 2022 16:07:13 GMT, Aleksey Shipilev wrote: > This seems to pass x86_32 `tier{1,2,3}` with Shenandoah. Thanks for testing, Aleksey! ------------- PR: https://git.openjdk.java.net/jdk/pull/8568 From thartmann at openjdk.java.net Tue May 10 08:02:55 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 08:02:55 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v2] In-Reply-To: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> References: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> Message-ID: <5HDiBrtxv_DVcCqZIHrJYXb6c-CcguThh5cjgag0quU=.e3588e0a-352d-4d55-ac3a-f8a2281e199a@github.com> On Tue, 10 May 2022 05:09:44 GMT, Pengfei Li wrote: > As in post loops, all packs are singleton, i.e., have_same_inputs is always true. Hence, we also add a pack size check here to make post loop logic run into this block. It's safe to let it go because post loop never needs scalar replications of the loop IV - it never combines nodes in packs. I don't understand why we need to execute that code for post loops. Could you elaborate why the `p->size() == 1` check is needed? test/hotspot/jtreg/compiler/vectorization/TestReplicateLoopIV.java line 1: > 1: Suggestion: ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From thartmann at openjdk.java.net Tue May 10 08:05:00 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 08:05:00 GMT Subject: RFR: 8286179: Node::find(int) should not traverse from new to old nodes [v3] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 14:11:33 GMT, Emanuel Peter wrote: >> **Problem:** >> `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. >> However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. >> >> **Implemented Solution:** >> 1. Remove traversing `debug_orig`. >> 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. >> 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). >> 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). >> >> I hope this improves your debugging experience. >> Ran larger tests to see that `find_node` did not break anything, reran sanity test after implementing suggestions from reviewers. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > converted nullptr asserts in nullptr returns Marked as reviewed by thartmann (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From duke at openjdk.java.net Tue May 10 08:07:56 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 10 May 2022 08:07:56 GMT Subject: Integrated: 8286179: Node::find(int) should not traverse from new to old nodes In-Reply-To: References: Message-ID: <5KmtEnWHctG9iYvi3uja_A8AzqyKblyLRjWqpjBFgX8=.1b9446de-1143-452f-a313-bbaf948a6094@github.com> On Fri, 6 May 2022 08:24:26 GMT, Emanuel Peter wrote: > **Problem:** > `Node::find` traverses input and output edges of nodes during its BFS, and searches for nodes with a specific `idx`. > However, if `ASSERT` is on, it also traverses `debug_orig`. This not only seems unnecessary. But Mach nodes (after matching) point back to the old IR nodes. This means we traverse from the new graph to the old graph, and potentially find multiple nodes matching the `idx`. Only the last found will be returned, sometimes this happens to be the new node, sometimes the old node. This is inconsistent and can be quite annoying during debugging. > > **Implemented Solution:** > 1. Remove traversing `debug_orig`. > 2. Instead, add debug only functions `old_root`, which finds the old root if it exists. Question: I now put a warning in if the `old_root` cannot be found. I think this is helpful for in the debugger. I could make it an assert if that is preferred. > 3. `find_node` and `find_ctrl` only search in new nodes now (start BFS at new root). > 4. Added `find_old_node` and `find_old_ctrl`, which search in new nodes (start BFS at old root). > > I hope this improves your debugging experience. > Ran larger tests to see that `find_node` did not break anything, reran sanity test after implementing suggestions from reviewers. This pull request has now been integrated. Changeset: d478958e Author: Emanuel Peter Committer: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/d478958eb2153199800689232d1d72e7f1ad7354 Stats: 36 lines in 1 file changed: 26 ins; 7 del; 3 mod 8286179: Node::find(int) should not traverse from new to old nodes Reviewed-by: kvn, chagedorn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8567 From shade at openjdk.java.net Tue May 10 08:17:14 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 10 May 2022 08:17:14 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined Message-ID: As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. Additional testing: - [x] Test now passes in unusual test modes reported - [x] Test still passes in default test modes on x86_32 and x86_64 ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/8625/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8625&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286339 Stats: 7 lines in 2 files changed: 5 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8625.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8625/head:pull/8625 PR: https://git.openjdk.java.net/jdk/pull/8625 From thartmann at openjdk.java.net Tue May 10 08:22:54 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 08:22:54 GMT Subject: RFR: 8286002: Add support for intel syntax to capstone hsdis In-Reply-To: References: Message-ID: On Mon, 2 May 2022 14:01:53 GMT, Jorn Vernee wrote: > This patch adds support for outputting assembly in intel syntax to capstone hsdis, through the `-XX:PrintAssemblyOptions=intel` flag. > > Snippet of example output: > > > [Verified Entry Point] > # {method} {0x0000021c8a4002d8} 'add' '(II)I' in 'Main' > # parm0: rdx = int > # parm1: r8 = int > # [sp+0x20] (sp of caller) > 0x0000021cfa713780: sub rsp, 0x18 > 0x0000021cfa713787: mov qword ptr [rsp + 0x10], rbp > 0x0000021cfa71378c: mov eax, edx > 0x0000021cfa71378e: add eax, r8d > 0x0000021cfa713791: add rsp, 0x10 > 0x0000021cfa713795: pop rbp > 0x0000021cfa713796: cmp rsp, qword ptr [r15 + 0x338] > ; {poll_return} > 0x0000021cfa71379d: ja 0x21cfa7137a4 > 0x0000021cfa7137a3: ret > 0x0000021cfa7137a4: movabs r10, 0x21cfa713796 ; {internal_word} > 0x0000021cfa7137ae: mov qword ptr [r15 + 0x350], r10 > 0x0000021cfa7137b5: jmp 0x21cfa6f3400 ; {runtime_call SafepointBlob} > ``` > > Testing: > - Manual testing with and without `-XX:PrintAssemblyOptions=intel`, to make sure that both syntaxes work. > - Manual testing with several different invalid options such as `-XX:PrintAssemblyOptions=asdf,,` to make sure that invalid options are handled correctly. > > Thanks, > Jorn Looks reasonable. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8502 From thartmann at openjdk.java.net Tue May 10 08:27:48 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 08:27:48 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: On Tue, 10 May 2022 08:11:12 GMT, Aleksey Shipilev wrote: > As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. > > Additional testing: > - [x] Test now passes in unusual test modes reported > - [x] Test still passes in default test modes on x86_32 and x86_64 Looks good and trivial. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8625 From jiefu at openjdk.java.net Tue May 10 08:27:48 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 10 May 2022 08:27:48 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: <6HtymYN22pxHGc7h2FaAH3vyk2ovrz1DsagOolj8PVk=.a32975a3-c2a5-485a-818c-cbe32bc780e5@github.com> On Tue, 10 May 2022 08:11:12 GMT, Aleksey Shipilev wrote: > As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. > > Additional testing: > - [x] Test now passes in unusual test modes reported > - [x] Test still passes in default test modes on x86_32 and x86_64 Shall we add `* @requires vm.flagless` for this test? ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From shade at openjdk.java.net Tue May 10 08:27:48 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 10 May 2022 08:27:48 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: <6HtymYN22pxHGc7h2FaAH3vyk2ovrz1DsagOolj8PVk=.a32975a3-c2a5-485a-818c-cbe32bc780e5@github.com> References: <6HtymYN22pxHGc7h2FaAH3vyk2ovrz1DsagOolj8PVk=.a32975a3-c2a5-485a-818c-cbe32bc780e5@github.com> Message-ID: On Tue, 10 May 2022 08:22:32 GMT, Jie Fu wrote: > Shall we add `* @requires vm.flagless` for this test? I don't think we should. The `CompileCommand`-s are just additional test configuration, not the test VM config per se. ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From roland at openjdk.java.net Tue May 10 08:28:15 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 10 May 2022 08:28:15 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v3] In-Reply-To: References: Message-ID: > Outside the type system code itself, c2 usually assumes that a > TypeOopPtr or a TypeKlassPtr's java type is fully represented by its > klass(). To have proper support for interfaces, that can't be true as > a type needs to be represented by an instance class and a set of > interfaces. This patch hides the klass() accessor of > TypeOopPtr/TypeKlassPtr and reworks c2 code that relies on it in a way > that makes that code suitable for proper interface support in a > subsequent change. This patch doesn't add proper interface support yet > and is mostly refactoring. "Mostly" because there are cases where the > previous logic would use a ciKlass but the new one works with a > TypeKlassPtr/TypeInstPtr which carries the ciKlass and whether the > klass is exact or not. That extra bit of information can sometimes > help and so could result in slightly different decisions. > > To remove the klass() accessors, the new logic either relies on: > > - new methods of TypeKlassPtr/TypeInstPtr. For instance, instead of: > toop->klass()->is_subtype_of(other_toop->klass()) > the new code is: > toop->is_java_subtype_of(other_toop) > > - variants of the klass() accessors for narrower cases like > TypeInstPtr::instance_klass() (returns _klass except if _klass is an > interface in which case it returns Object), > TypeOopPtr::unloaded_klass() (returns _klass but only when the klass > is unloaed), TypeOopPtr::exact_klass() (returns _klass but only when > the type is exact). > > When I tested this patch, for most changes in this patch, I had the > previous logic, the new logic and a check that verified that they > return the same result. I ran as much testing as I could that way. Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: - Merge branch 'master' into JDK-8275201 - review - Merge branch 'master' into JDK-8275201 - Merge branch 'master' into JDK-8275201 - build fix - Merge branch 'master' into JDK-8275201 - whitespaces - remove klass accessor ------------- Changes: https://git.openjdk.java.net/jdk/pull/6717/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=6717&range=02 Stats: 1221 lines in 33 files changed: 560 ins; 180 del; 481 mod Patch: https://git.openjdk.java.net/jdk/pull/6717.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/6717/head:pull/6717 PR: https://git.openjdk.java.net/jdk/pull/6717 From jiefu at openjdk.java.net Tue May 10 08:40:47 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 10 May 2022 08:40:47 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: <6HtymYN22pxHGc7h2FaAH3vyk2ovrz1DsagOolj8PVk=.a32975a3-c2a5-485a-818c-cbe32bc780e5@github.com> Message-ID: On Tue, 10 May 2022 08:24:37 GMT, Aleksey Shipilev wrote: > > Shall we add `* @requires vm.flagless` for this test? > > I don't think we should. The `CompileCommand`-s are just additional test configuration, not the test VM config per se. Okay. I am just wondering if the test would still pass with vm args like "-XX:-Inline". ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From pli at openjdk.java.net Tue May 10 08:44:49 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 10 May 2022 08:44:49 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v2] In-Reply-To: <5HDiBrtxv_DVcCqZIHrJYXb6c-CcguThh5cjgag0quU=.e3588e0a-352d-4d55-ac3a-f8a2281e199a@github.com> References: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> <5HDiBrtxv_DVcCqZIHrJYXb6c-CcguThh5cjgag0quU=.e3588e0a-352d-4d55-ac3a-f8a2281e199a@github.com> Message-ID: On Tue, 10 May 2022 07:59:12 GMT, Tobias Hartmann wrote: >> Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: >> >> - Merge branch 'master' into slpfix >> - 8286125: C2: "bad AD file" with PopulateIndex on x86_64 >> >> A fuzzer test reports an assertion failure issue with PopulateIndexNode >> on x86_64. It can be reproduced by the new jtreg case inside this patch. >> Root cause is that C2 superword creates a PopulateIndexNode by mistake >> while vectorizing below loop. >> >> for (int i = 304; i > 15; i -= 3) { >> int c = 16; >> do { >> for (int t = 1; t < 1; t++) {} >> arr[c + 1] >>= i; >> } while (--c > 0); >> } >> >> This is a corner loop case with redundant code inside. After several C2 >> optimizations, the do-while loop inside is unrolled and then isomorphic >> right shift statements can be combined in the superword optimization. >> Since all shift counts are the same loop IV value `i`, superword should >> generate a RShiftCntVNode to create a vector of scalar replications of >> the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by >> mistake because of the `opd == iv()` condition. >> >> To fix this, we add a `have_same_inputs` condition here checking if all >> inputs at position `opd_idx` of nodes in the pack are the same. If true, >> C2 code should NOT run into this block to generate a PopulateIndexNode. >> Instead, it should run into the next block for scalar replications. >> >> Additionally, only adding this condition here is still not good enough >> because it breaks the experimental post loop vectorization. As in post >> loops, all packs are singleton, i.e., `have_same_inputs` is always true. >> Hence, we also add a pack size check here to make post loop logic run >> into this block. It's safe to let it go because post loop never needs >> scalar replications of the loop IV - it never combines nodes in packs. >> >> We also add two more assertions in the code. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. > >> As in post loops, all packs are singleton, i.e., have_same_inputs is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > I don't understand why we need to execute that code for post loops. Could you elaborate why the `p->size() == 1` check is needed? Hi @TobiHartmann , Thanks for your review. > I don't understand why we need to execute that code for post loops. Could you elaborate why the `p->size() == 1` check is needed? In post loop vectorization, superword doesn't combine multiple scalar nodes to one vector node. Instead, each scalar node is transformed to a vector node with vector mask. If a loop body statement has the induction variable involved, such as for (int i = 0; i < length; i++) { a[i] = i; } then superword should run into this `if (opd == iv() && ...) {...}` block to create an index vector for vectorizing post loop. But packs in post loops are special (singletons) - each has only one scalar node inside. So `same_inputs(...)` always returns true. If we just add the `have_same_inputs` condition, post loop vectorization will generate incorrect result because index vectors cannot be created. Hence we add another `p->size() == 1` check here. ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From thartmann at openjdk.java.net Tue May 10 08:49:49 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 08:49:49 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: On Tue, 10 May 2022 08:11:12 GMT, Aleksey Shipilev wrote: > As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. > > Additional testing: > - [x] Test now passes in unusual test modes reported > - [x] Test still passes in default test modes on x86_32 and x86_64 I think many of our tests would fail with such invasive flag values. I don't think we can/should guard against all possible combinations of VM flags. ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From jiefu at openjdk.java.net Tue May 10 09:02:56 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 10 May 2022 09:02:56 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> On Tue, 10 May 2022 08:46:05 GMT, Tobias Hartmann wrote: > I think many of our tests would fail with such invasive flag values. I don't think we can/should guard against all possible combinations of VM flags. There are more than 100 tests under `hotspot/jtreg/compiler`, which `* @requires vm.flagless`. Many of them seems to be added recently. So is there a rule when should we add it? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From thartmann at openjdk.java.net Tue May 10 09:05:56 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 09:05:56 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v2] In-Reply-To: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> References: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> Message-ID: On Tue, 10 May 2022 05:09:44 GMT, Pengfei Li wrote: >> A fuzzer test reports an assertion failure issue with PopulateIndexNode >> on x86_64. It can be reproduced by the new jtreg case inside this patch. >> Root cause is that C2 superword creates a PopulateIndexNode by mistake >> while vectorizing below loop. >> >> for (int i = 304; i > 15; i -= 3) { >> int c = 16; >> do { >> for (int t = 1; t < 1; t++) {} >> arr[c + 1] >>= i; >> } while (--c > 0); >> } >> >> This is a corner loop case with redundant code inside. After several C2 >> optimizations, the do-while loop inside is unrolled and then isomorphic >> right shift statements can be combined in the superword optimization. >> Since all shift counts are the same loop IV value `i`, superword should >> generate a RShiftCntVNode to create a vector of scalar replications of >> the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by >> mistake because of the `opd == iv()` condition. >> >> To fix this, we add a `have_same_inputs` condition here checking if all >> inputs at position `opd_idx` of nodes in the pack are the same. If true, >> C2 code should NOT run into this block to generate a PopulateIndexNode. >> Instead, it should run into the next block for scalar replications. >> >> Additionally, only adding this condition here is still not good enough >> because it breaks the experimental post loop vectorization. As in post >> loops, all packs are singleton, i.e., `have_same_inputs` is always true. >> Hence, we also add a pack size check here to make post loop logic run >> into this block. It's safe to let it go because post loop never needs >> scalar replications of the loop IV - it never combines nodes in packs. >> >> We also add two more assertions in the code. >> >> Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 >> are tested and no issue is found. > > Pengfei Li has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into slpfix > - 8286125: C2: "bad AD file" with PopulateIndex on x86_64 > > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Please remove the newline from the beginning of the test before integrating. Thanks for the clarification. Makes sense. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8587 From duke at openjdk.java.net Tue May 10 09:18:50 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 10 May 2022 09:18:50 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v6] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 16:56:36 GMT, Emanuel Peter wrote: >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. >> >> `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> **1. Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). >> 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! >> 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. >> 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. >> >> Example: >> >> (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") >> No target: perform BFS. >> dis par c dump >> --------------------------------------------- >> 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] >> 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] >> 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL >> 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] >> 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] >> >> >> Example with Mach nodes: >> >> (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") >> No target: perform BFS. >> dis [head idom d] old par c dump >> --------------------------------------------- >> 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] >> 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] >> 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] >> 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] >> 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] >> 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] >> 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] >> 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] >> >> >> **2. Find loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "cox+")` >> This provides us with a shortest path, given this path has a distance of at most 20. >> >> Example: >> >> (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") >> Find shortest path: 158 -> 160. >> >> Backtrace target. >> dis c dump >> --------------------------------------------- >> 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] >> 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] >> 7 c 363 If === 358 351 [[ 364 367 ]] >> 6 c 364 IfTrue === 363 [[ 128 ]] >> 5 c 128 If === 364 127 [[ 129 130 ]] >> 4 c 129 IfTrue === 128 [[ 155 ]] >> 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] >> 2 c 157 IfFalse === 155 [[ 162 163 ]] >> 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] >> 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] >> >> Example with Mach nodes: >> >> (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") >> Find shortest path: 159 -> 27. >> >> Backtrace target. >> dis [head idom d] old e c dump >> --------------------------------------------- >> 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] >> 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] >> 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > missed in last commit Feedback from @chhagedorn : refactor `print_bfs` into a class with methods, rather than one large function with lambdas. I agree with this idea, will do it soon. ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From thartmann at openjdk.java.net Tue May 10 09:26:49 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 09:26:49 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Fri, 6 May 2022 10:42:52 GMT, Roberto Casta?eda Lozano wrote: > This changeset lowers the priority of locally-dependent CreateEx nodes, that is CreateEx nodes that are not initially ready for scheduling in LCM. The proposed scheme assigns them the same priority as projection nodes when selecting the next node to be scheduled, restoring the relative prioritization between projections and CreateEx nodes to the state it was before [JDK-8270090](https://bugs.openjdk.java.net/browse/JDK-8270090). JDK-8270090 wrongly gave all CreateEx nodes the highest priority, which leads to failures whenever projection nodes are expected to get higher priority than locally-dependent CreateEx nodes. See the [JBS issue report](https://bugs.openjdk.java.net/browse/JDK-8285820) for further detail. > > More specifically, the current ranking to select the next node to be scheduled in `PhaseCFG::select()` is: > > 1. CreateEx nodes (initially ready or not) > 2. Projections > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > After this changeset, the ranking becomes: > > 1. Initially ready CreateEx nodes > 2. Projections and other CreateEx nodes (tie) > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > which still addresses the issue handled by JDK-8270090 but in a form that is closer to the original ranking before JDK-8270090: > > 1. Initially ready CreateEx nodes > 2. Projections, other CreateEx nodes, constants and CheckCastPP nodes (tie) > 3. ... > > This changeset implements the minimal changes to restore the relative prioritization between CreateEx nodes and projections to the state it was before JDK-8270090, for risk minimization and ease of backporting. I will file a separate RFE proposing a more robust alternative than altering the order of the LCM worklist for ensuring that initially ready CreateEx nodes are scheduled at the block start. > > #### Testing > > ##### Functionality > > - Original failure on x86_32 using `-XX:+UseShenandoahGC` (thanks to Aleksey Shipilev for testing). > - Original failure of JDK-8270090 on arm32 (thanks to Marc Hoffmann for testing). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with StressLCM and StressGCM (5 different seeds). > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on linux-x64, linux-aarch64, windows-x64, and macosx-x64. No significant regression was observed. Looks reasonable to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8568 From rcastanedalo at openjdk.java.net Tue May 10 09:31:41 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 10 May 2022 09:31:41 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Tue, 10 May 2022 09:23:16 GMT, Tobias Hartmann wrote: > Looks reasonable to me. Thanks for reviewing, Tobias! ------------- PR: https://git.openjdk.java.net/jdk/pull/8568 From pli at openjdk.java.net Tue May 10 09:51:43 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 10 May 2022 09:51:43 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v3] In-Reply-To: References: Message-ID: > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. Pengfei Li has updated the pull request incrementally with one additional commit since the last revision: Remove an empty line ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8587/files - new: https://git.openjdk.java.net/jdk/pull/8587/files/3af03a80..61846d45 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8587&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8587&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8587.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8587/head:pull/8587 PR: https://git.openjdk.java.net/jdk/pull/8587 From aph-open at littlepinkcloud.com Tue May 10 11:59:57 2022 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Tue, 10 May 2022 12:59:57 +0100 Subject: C2: Did something just happen to unrolling? In-Reply-To: References: <871qy2e277.fsf@redhat.com> <14141d7e-f83c-db68-1061-9523e72ad13e@oracle.com> Message-ID: <47286d25-29e5-b3ab-cb25-4be8e731e288@littlepinkcloud.com> On 4/27/22 06:01, Tobias Hartmann wrote: > Andrew, do you have a reproducer so we can file a bug? Just for the record: the problem was caused code in loopTransform.cpp that went in as part of 8279508: Auto-vectorize Math.round API. I was auto-vectorizing Math.round, but nothing was working. My problem was that the code in 8279508 to IdealLoopTree::policy_unroll assigns a very heavy weight to all of the Op_Round operations, so they don't much unroll. This is ok for x86 avx512 but probably not for anything else. It isn't appropriate for the weighting to be in shared code, so I moved it to cpu-specific code as part of the commit for 8282541, AArch64: Auto-vectorize Math.round API https://github.com/openjdk/jdk/commit/a7b5157375f3691a7425f15a78cd5411776b9331 Incidentally, policy_unroll also gives a heavy weight for Op_ModL, DivL, MulL. I think that's a hangover from x86-32, and now we have 64-bit division instructions perhaps it can be usefully removed. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From pli at openjdk.java.net Tue May 10 12:03:55 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 10 May 2022 12:03:55 GMT Subject: RFR: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 [v2] In-Reply-To: References: <1AlUgFjqB8hrqZTsbpIhxcsnx-pt2i7z0nw9pkbPK8k=.325ab5ed-dec3-4771-ab51-7285e162a9d0@github.com> Message-ID: On Tue, 10 May 2022 09:02:15 GMT, Tobias Hartmann wrote: > Please remove the newline from the beginning of the test before integrating. Thanks for reminding me. I've done that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From shade at openjdk.java.net Tue May 10 12:22:49 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 10 May 2022 12:22:49 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> References: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> Message-ID: On Tue, 10 May 2022 08:59:43 GMT, Jie Fu wrote: > There are more than 100 tests under `hotspot/jtreg/compiler`, which `* @requires vm.flagless`. Many of them seems to be added recently. There are no flagless tests in `compiler/c2/irTests`, though. AFAIU, the intent for `vm.flagless` is to skip non-default run configurations for the tests which are driven internally with _complete overwrite_ of passed test options. In other words, `vm.flagless` is there to mark the tests where putting different VM options on the driver code does not really affect the test. IR tests are not like this, so `vm.flagless` makes little sense there. (One can imagine putting more compiler stress options to TEST_VM_OPTS...) ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From jbhateja at openjdk.java.net Tue May 10 12:48:25 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 10 May 2022 12:48:25 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: Message-ID: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Correcting a typo. - 8284960: Integrating changes from panama-vector (Add @since 19 tags). - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: AARCH64 backend changes. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - ... and 1 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8425/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=02 Stats: 37901 lines in 214 files changed: 16527 ins; 16924 del; 4450 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From pli at openjdk.java.net Tue May 10 13:40:55 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 10 May 2022 13:40:55 GMT Subject: Integrated: 8286125: C2: "bad AD file" with PopulateIndex on x86_64 In-Reply-To: References: Message-ID: On Sat, 7 May 2022 13:23:54 GMT, Pengfei Li wrote: > A fuzzer test reports an assertion failure issue with PopulateIndexNode > on x86_64. It can be reproduced by the new jtreg case inside this patch. > Root cause is that C2 superword creates a PopulateIndexNode by mistake > while vectorizing below loop. > > for (int i = 304; i > 15; i -= 3) { > int c = 16; > do { > for (int t = 1; t < 1; t++) {} > arr[c + 1] >>= i; > } while (--c > 0); > } > > This is a corner loop case with redundant code inside. After several C2 > optimizations, the do-while loop inside is unrolled and then isomorphic > right shift statements can be combined in the superword optimization. > Since all shift counts are the same loop IV value `i`, superword should > generate a RShiftCntVNode to create a vector of scalar replications of > the loop IV. But after JDK-8280510, a PopulateIndexNode is generated by > mistake because of the `opd == iv()` condition. > > To fix this, we add a `have_same_inputs` condition here checking if all > inputs at position `opd_idx` of nodes in the pack are the same. If true, > C2 code should NOT run into this block to generate a PopulateIndexNode. > Instead, it should run into the next block for scalar replications. > > Additionally, only adding this condition here is still not good enough > because it breaks the experimental post loop vectorization. As in post > loops, all packs are singleton, i.e., `have_same_inputs` is always true. > Hence, we also add a pack size check here to make post loop logic run > into this block. It's safe to let it go because post loop never needs > scalar replications of the loop IV - it never combines nodes in packs. > > We also add two more assertions in the code. > > Jtreg hotspot::hotspot_all_no_apps, jdk::tier1~3 and langtools::tier1 > are tested and no issue is found. This pull request has now been integrated. Changeset: 1ca54046 Author: Pengfei Li URL: https://git.openjdk.java.net/jdk/commit/1ca540460cb3ca9de92ba6d9dd417526e333f91e Stats: 66 lines in 2 files changed: 63 ins; 0 del; 3 mod 8286125: C2: "bad AD file" with PopulateIndex on x86_64 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8587 From jiefu at openjdk.java.net Tue May 10 14:57:06 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 10 May 2022 14:57:06 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> Message-ID: <82ll94nDgwOmv-u9XHBC3DfuL0CYvSb-NHmFvJrKkgk=.c36cf26f-4831-403d-87ea-ac90b764bb34@github.com> On Tue, 10 May 2022 12:16:41 GMT, Aleksey Shipilev wrote: > > There are more than 100 tests under `hotspot/jtreg/compiler`, which `* @requires vm.flagless`. Many of them seems to be added recently. > > There are no flagless tests in `compiler/c2/irTests`, though. AFAIU, the intent for `vm.flagless` is to skip non-default run configurations for the tests which are driven internally with _complete overwrite_ of passed test options. In other words, `vm.flagless` is there to mark the tests where putting different VM options on the driver code does not really affect the test. IR tests are not like this, so `vm.flagless` makes little sense there. (One can imagine putting more compiler stress options to TEST_VM_OPTS...) Thanks @shipilev for your explanation. I didn't try if the test would fail with something like `-XX:-Inline`. If it may fail, why not make it to be more robust by just adding `vm.flagless`? ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From kvn at openjdk.java.net Tue May 10 14:58:49 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 10 May 2022 14:58:49 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Fri, 6 May 2022 10:42:52 GMT, Roberto Casta?eda Lozano wrote: > This changeset lowers the priority of locally-dependent CreateEx nodes, that is CreateEx nodes that are not initially ready for scheduling in LCM. The proposed scheme assigns them the same priority as projection nodes when selecting the next node to be scheduled, restoring the relative prioritization between projections and CreateEx nodes to the state it was before [JDK-8270090](https://bugs.openjdk.java.net/browse/JDK-8270090). JDK-8270090 wrongly gave all CreateEx nodes the highest priority, which leads to failures whenever projection nodes are expected to get higher priority than locally-dependent CreateEx nodes. See the [JBS issue report](https://bugs.openjdk.java.net/browse/JDK-8285820) for further detail. > > More specifically, the current ranking to select the next node to be scheduled in `PhaseCFG::select()` is: > > 1. CreateEx nodes (initially ready or not) > 2. Projections > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > After this changeset, the ranking becomes: > > 1. Initially ready CreateEx nodes > 2. Projections and other CreateEx nodes (tie) > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > which still addresses the issue handled by JDK-8270090 but in a form that is closer to the original ranking before JDK-8270090: > > 1. Initially ready CreateEx nodes > 2. Projections, other CreateEx nodes, constants and CheckCastPP nodes (tie) > 3. ... > > This changeset implements the minimal changes to restore the relative prioritization between CreateEx nodes and projections to the state it was before JDK-8270090, for risk minimization and ease of backporting. I will file a separate RFE proposing a more robust alternative than altering the order of the LCM worklist for ensuring that initially ready CreateEx nodes are scheduled at the block start. > > #### Testing > > ##### Functionality > > - Original failure on x86_32 using `-XX:+UseShenandoahGC` (thanks to Aleksey Shipilev for testing). > - Original failure of JDK-8270090 on arm32 (thanks to Marc Hoffmann for testing). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with StressLCM and StressGCM (5 different seeds). > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on linux-x64, linux-aarch64, windows-x64, and macosx-x64. No significant regression was observed. Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8568 From rcastanedalo at openjdk.java.net Tue May 10 15:04:48 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 10 May 2022 15:04:48 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: <1MTMnAJAzkqlA0dC0iGJutgfvnAc3yS9iF1EHolwlSY=.f63ba341-eb59-4789-80fd-cffe4212de38@github.com> On Tue, 10 May 2022 14:55:22 GMT, Vladimir Kozlov wrote: > Looks good. Thanks for the review, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8568 From jiefu at openjdk.java.net Tue May 10 15:07:11 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 10 May 2022 15:07:11 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: On Tue, 10 May 2022 08:11:12 GMT, Aleksey Shipilev wrote: > As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. > > Additional testing: > - [x] Test now passes in unusual test modes reported > - [x] Test still passes in default test modes on x86_32 and x86_64 Marked as reviewed by jiefu (Reviewer). > > I didn't try if the test would fail with something like `-XX:-Inline`. If it may fail, why not make it to be more robust by just adding `vm.flagless`? > > The test passes with `-XX:-Inline`, I suspect because IR Testing Framework is smart about it: > > ``` > IR verification disabled either due to no @IR annotations, through explicitly setting -DVerify=false, > due to not running a debug build, using a non-whitelisted JTreg VM or Javaopts flag like -Xint, or > running the test VM with other VM flags added by user code that make the IR verification impossible > (e.g. -XX:-UseCompile, -XX:TieredStopAtLevel=[1,2,3], etc.)." > ``` > > I didn't try if the test would fail with something like `-XX:-Inline`. If it may fail, why not make it to be more robust by just adding `vm.flagless`? > > The test passes with `-XX:-Inline`, I suspect because IR Testing Framework is smart about it: > > ``` > IR verification disabled either due to no @IR annotations, through explicitly setting -DVerify=false, > due to not running a debug build, using a non-whitelisted JTreg VM or Javaopts flag like -Xint, or > running the test VM with other VM flags added by user code that make the IR verification impossible > (e.g. -XX:-UseCompile, -XX:TieredStopAtLevel=[1,2,3], etc.)." > ``` OK. Thanks for your clarification. ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From shade at openjdk.java.net Tue May 10 15:07:12 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 10 May 2022 15:07:12 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: <82ll94nDgwOmv-u9XHBC3DfuL0CYvSb-NHmFvJrKkgk=.c36cf26f-4831-403d-87ea-ac90b764bb34@github.com> References: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> <82ll94nDgwOmv-u9XHBC3DfuL0CYvSb-NHmFvJrKkgk=.c36cf26f-4831-403d-87ea-ac90b764bb34@github.com> Message-ID: On Tue, 10 May 2022 14:53:36 GMT, Jie Fu wrote: > I didn't try if the test would fail with something like `-XX:-Inline`. If it may fail, why not make it to be more robust by just adding `vm.flagless`? The test passes with `-XX:-Inline`, I suspect because IR Testing Framework is smart about it: IR verification disabled either due to no @IR annotations, through explicitly setting -DVerify=false, due to not running a debug build, using a non-whitelisted JTreg VM or Javaopts flag like -Xint, or running the test VM with other VM flags added by user code that make the IR verification impossible (e.g. -XX:-UseCompile, -XX:TieredStopAtLevel=[1,2,3], etc.)." ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From thartmann at openjdk.java.net Tue May 10 15:07:13 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 10 May 2022 15:07:13 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> <82ll94nDgwOmv-u9XHBC3DfuL0CYvSb-NHmFvJrKkgk=.c36cf26f-4831-403d-87ea-ac90b764bb34@github.com> Message-ID: <2yM-iRML-C2-BVZeEFnuHxJ4QA5ejf0I2Kod3q6tbPg=.525538c4-565b-41c8-aed5-dcc38cb588fb@github.com> On Tue, 10 May 2022 15:00:35 GMT, Aleksey Shipilev wrote: > If it may fail, why not make it to be more robust by just adding vm.flagless? Because that will prevent the test from being executed with any other VM flag and we don't want that. The IR verification framework specifically whitelists some flag combinations and disables IR verification with others. ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From jiefu at openjdk.java.net Tue May 10 15:09:27 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 10 May 2022 15:09:27 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: <2yM-iRML-C2-BVZeEFnuHxJ4QA5ejf0I2Kod3q6tbPg=.525538c4-565b-41c8-aed5-dcc38cb588fb@github.com> References: <23TcyT_GQspGb89TU3rl8aTxK3bXvHuZe1WJQU3yl58=.71c00722-3b9a-4ec4-b542-89ad32220682@github.com> <82ll94nDgwOmv-u9XHBC3DfuL0CYvSb-NHmFvJrKkgk=.c36cf26f-4831-403d-87ea-ac90b764bb34@github.com> <2yM-iRML-C2-BVZeEFnuHxJ4QA5ejf0I2Kod3q6tbPg=.525538c4-565b-41c8-aed5-dcc38cb588fb@github.com> Message-ID: On Tue, 10 May 2022 15:03:00 GMT, Tobias Hartmann wrote: > > If it may fail, why not make it to be more robust by just adding vm.flagless? > > Because that will prevent the test from being executed with any other VM flag and we don't want that. The IR verification framework specifically whitelists some flag combinations and disables IR verification with others. Since it still passes even with `-XX:-Inline`, so I think it's fine too. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From lucy at openjdk.java.net Tue May 10 18:20:47 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Tue, 10 May 2022 18:20:47 GMT Subject: RFR: 8285733: [s390] Vector Instruction Emitters for element-wise access are broken In-Reply-To: References: Message-ID: <3A3p1gqdg7IW68EVcbWZKjpfgQFCYOjAlq8FbW0f5QI=.a7ba3ae0-198b-46b9-9895-52c85bbde095@github.com> On Mon, 9 May 2022 16:12:16 GMT, Martin Doerr wrote: > Fix LGTM. I guess this platform is no longer usable after Loom integration. We may need to test it in an updates release. Thanks for reviewing! Backporting the fix to (at least) jdk17 would be beneficial, just for correctness. I do not expect enthusiastic acceptance, though. The existing code in jdk17 is not at risk. To test the fix, you would need code which exploits the fixed instructions. ------------- PR: https://git.openjdk.java.net/jdk/pull/8537 From xliu at openjdk.java.net Tue May 10 18:26:55 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 10 May 2022 18:26:55 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: <0QZMnregHu9Wg4zmfYzdHGWOUNSUQdqRJK_UrWJtSRo=.ce01a24f-2661-4ae9-9bc8-a8c49729e901@github.com> On Mon, 9 May 2022 23:48:58 GMT, Vladimir Kozlov wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/compile.cpp line 2217: > >> 2215: Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); >> 2216: Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); >> 2217: #endif > > These should be done inside `ConnectionGraph` - don't expose EA counters to an other class. You can use a static method in `ConnectionGraph` to do that. > > `_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced` is wrong. You are doubling number because `total_scalar_replaced` is part of `_local_no_escape_ctr`. Keep these numbers separate. Also `mexp._local_scalar_replaced` could be update later during `PhaseMacroExpand::expand_macro_nodes()` call after loop optimizations. > > And such collection is not accurate (over-counted) due to EA iterations - each iteration may add the same numbers. Which could be fine if you say that in comments so people know. hi, @vnkozlov My understanding is 'total_scalar_replaced' is all scalarized objects in all prior EA iterations.It's kinda of adjustment. Iterative EA diminishes some java objects. The reported data are last snapshot with adjustment. They account for all java objects. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Tue May 10 19:33:35 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Tue, 10 May 2022 19:33:35 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: <59GZTeLJbHSb1akqX94aN11VrSWnMuAghynnF0ciPM4=.3d82a0a4-3968-4284-b252-e4d866075870@github.com> On Mon, 9 May 2022 23:48:58 GMT, Vladimir Kozlov wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/compile.cpp line 2217: > >> 2215: Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); >> 2216: Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); >> 2217: #endif > > These should be done inside `ConnectionGraph` - don't expose EA counters to an other class. You can use a static method in `ConnectionGraph` to do that. > > `_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced` is wrong. You are doubling number because `total_scalar_replaced` is part of `_local_no_escape_ctr`. Keep these numbers separate. Also `mexp._local_scalar_replaced` could be update later during `PhaseMacroExpand::expand_macro_nodes()` call after loop optimizations. > > And such collection is not accurate (over-counted) due to EA iterations - each iteration may add the same numbers. Which could be fine if you say that in comments so people know. Thanks for the review @vnkozlov . > These should be done inside ConnectionGraph - don't expose EA counters to an other class. You can use a static method in ConnectionGraph to do that. I also didn't like this. The iterative EA and not having the last `congraph` available after the loop complicated a bit the logic here. We'll work to improve this code. > _no_escape_counter, _local_no_escape_ctr + total_scalar_replaced is wrong. You are doubling number because total_scalar_replaced is part of _local_no_escape_ctr I see your point here. As @navyxliu mentioned, the "local_scalar_replaced" is used as an adjustment to keep record of the objects scalar replaced in **previous iterations**. However, this approach incorrectly handles numbers in the **last iteration**. In the last iteration _local_no_escape_ctr is always correct and we don't need to add the objects SR in that iteration. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From vlivanov at openjdk.java.net Tue May 10 21:06:18 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 10 May 2022 21:06:18 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: Message-ID: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> On Mon, 9 May 2022 10:28:27 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: > > - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 > - Remove unneeded ComputeMoveOrder > - Remove comment about native calls in lcm.cpp > - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 > > Reviewed-by: jvernee, mcimadamore > - Update riscv and arm stubs > - Remove spurious ProblemList change > - Pass pointer to LogStream > - Polish > - Replace TraceNativeInvokers flag with unified logging > - Fix other platforms, take 2 > - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 Nice work! Looks good. Some minor comments/questions follow. src/hotspot/cpu/aarch64/frame_aarch64.cpp line 379: > 377: // need unextended_sp here, since normal sp is wrong for interpreter callees > 378: return reinterpret_cast( > 379: reinterpret_cast(frame.unextended_sp()) + in_bytes(_frame_data_offset)); Maybe use `address` instead of `char*`? src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5531: > 5529: } > 5530: > 5531: // On 64 bit we will store integer like items to the stack as Time for a cleanup? `64 bit` vs `64bit`, `abi`, `Aarch64`. src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5547: > 5545: } else if (dst.first()->is_stack()) { > 5546: // reg to stack > 5547: // Do we really have to sign extend??? Obsolete? Remove? src/hotspot/cpu/aarch64/universalUpcallHandler_aarch64.cpp line 306: > 304: intptr_t exception_handler_offset = __ pc() - start; > 305: > 306: // Native caller has no idea how to handle exceptions, Can you elaborate, please, how it is expected to work in presence of asynchronous exceptions? I'd expect to see a code which unconditionally clears pending exception with an assertion that verifies that the exception is of expected type. src/hotspot/cpu/x86/foreign_globals_x86.hpp line 30: > 28: #include "utilities/growableArray.hpp" > 29: > 30: class outputStream; Redundant declaration? src/hotspot/cpu/x86/foreign_globals_x86_64.cpp line 52: > 50: > 51: objArrayOop inputStorage = jdk_internal_foreign_abi_ABIDescriptor::inputStorage(abi_oop); > 52: loadArray(inputStorage, INTEGER_TYPE, abi._integer_argument_registers, as_Register); `loadArray` helper looks a bit misleading. In presence of `javaClass`-style accessors, it misleadingly hints that it refers to some Java-level operation/entity, though what it does it parses register list representation backed by a Java array. I suggest to rename it to something like `parse_argument_registers_array()`). src/hotspot/cpu/x86/macroAssembler_x86.cpp line 933: > 931: } else { > 932: assert(dst.is_single_reg(), "not a stack pair: (%s, %s), (%s, %s)", > 933: src.first()->name(), src.second()->name(), dst.first()->name(), dst.second()->name()); Wrong indentation. src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 36: > 34: #include "code/nativeInst.hpp" > 35: #include "code/vtableStubs.hpp" > 36: #include "compiler/disassembler.hpp" Redundant includes? No new code added in the file. src/hotspot/share/c1/c1_GraphBuilder.cpp line 4230: > 4228: > 4229: case vmIntrinsics::_linkToNative: > 4230: print_inlining(callee, "Native call", /*success*/ false); Since the message is appended, lower case is preferred:`"native call"`. src/hotspot/share/code/codeBlob.hpp line 754: > 752: class ProgrammableUpcallHandler; > 753: > 754: class OptimizedEntryBlob: public RuntimeBlob { What's the motivation to move `OptimizedEntryBlob` up in the hierarchy (from `BufferBlob` to `RuntimeBlob`)? src/hotspot/share/opto/callGenerator.cpp line 1131: > 1129: > 1130: case vmIntrinsics::_linkToNative: > 1131: print_inlining_failure(C, callee, jvms->depth() - 1, jvms->bci(), Why is it unconditionally reported as inlining failure? src/hotspot/share/prims/foreign_globals.cpp line 147: > 145: // based on ComputeMoveOrder from x86_64 shared runtime code. > 146: // with some changes. > 147: class ForeignCMO: public StackObj { Considering how seldom it is used, I don't see much value in abbreviating it. Also, the comment is misleading: there's no such entity as `ComputeMoveOrder` in the code. And `compute_move_order` is completely removed by this change. src/hotspot/share/prims/foreign_globals.cpp line 217: > 215: > 216: public: > 217: ForeignCMO(int total_in_args, const VMRegPair* in_regs, int total_out_args, VMRegPair* out_regs, I propose to turn it into a trivial ctor and move all the logic into a helper static function which returns the computed moves. src/hotspot/share/prims/foreign_globals.hpp line 35: > 33: #include CPU_HEADER(foreign_globals) > 34: > 35: class CallConvClosure { Just a question on terminology: why is it called a `Closure`? src/hotspot/share/prims/foreign_globals.hpp line 62: > 60: > 61: > 62: class JavaCallConv : public CallConvClosure { Does it really worth to abbreviate `CallingConvention` to `CallConv`? src/java.base/share/classes/jdk/internal/foreign/abi/SharedUtils.java line 313: > 311: MethodType newType = oldType.dropParameterTypes(destIndex, destIndex + 1); > 312: int[] reorder = new int[oldType.parameterCount()]; > 313: if (destIndex < sourceIndex) Misses braces. src/java.base/share/classes/jdk/internal/foreign/abi/aarch64/AArch64Architecture.java line 169: > 167: stackAlignment, > 168: shadowSpace, > 169: targetAddrStorage, retBufAddrStorage); Wrong indentation. src/java.base/share/classes/jdk/internal/foreign/abi/x64/X86_64Architecture.java line 156: > 154: stackAlignment, > 155: shadowSpace, > 156: targetAddrStorage, retBufAddrStorage); Wrong indentation. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From vlivanov at openjdk.java.net Tue May 10 21:06:20 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Tue, 10 May 2022 21:06:20 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: Message-ID: On Mon, 28 Mar 2022 12:15:10 GMT, Jorn Vernee wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/share/utilities/growableArray.hpp line 151: > >> 149: return _data; >> 150: } >> 151: > > This accessor is added to be able to temporarily view a stable GrowableArray instance as a C-style array. It is used to by `NativeCallConv` and `RegSpiller` in `foreign_globals.hpp`. > > GrowableArray already has an `adr_at` accessor that does something similar, but using `adr_at(0)` fails on empty growable arrays since it also performs a bounds check. So it can not be used. Any problems with migrating `CallConv` and `RegSpiller`away from ` VMReg* + int` to `GrowableArray`? ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From xliu at openjdk.java.net Tue May 10 21:17:54 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 10 May 2022 21:17:54 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: <59GZTeLJbHSb1akqX94aN11VrSWnMuAghynnF0ciPM4=.3d82a0a4-3968-4284-b252-e4d866075870@github.com> References: <59GZTeLJbHSb1akqX94aN11VrSWnMuAghynnF0ciPM4=.3d82a0a4-3968-4284-b252-e4d866075870@github.com> Message-ID: On Tue, 10 May 2022 19:31:03 GMT, Cesar Soares wrote: >> src/hotspot/share/opto/compile.cpp line 2217: >> >>> 2215: Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); >>> 2216: Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); >>> 2217: #endif >> >> These should be done inside `ConnectionGraph` - don't expose EA counters to an other class. You can use a static method in `ConnectionGraph` to do that. >> >> `_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced` is wrong. You are doubling number because `total_scalar_replaced` is part of `_local_no_escape_ctr`. Keep these numbers separate. Also `mexp._local_scalar_replaced` could be update later during `PhaseMacroExpand::expand_macro_nodes()` call after loop optimizations. >> >> And such collection is not accurate (over-counted) due to EA iterations - each iteration may add the same numbers. Which could be fine if you say that in comments so people know. > > Thanks for the review @vnkozlov . > >> These should be done inside ConnectionGraph - don't expose EA counters to an other class. You can use a static method in ConnectionGraph to do that. > > I also didn't like this. The iterative EA and not having the last `congraph` available after the loop complicated a bit the logic here. We'll work to improve this code. > >> _no_escape_counter, _local_no_escape_ctr + total_scalar_replaced is wrong. You are doubling number because total_scalar_replaced is part of _local_no_escape_ctr > > I see your point here. As @navyxliu mentioned, the "local_scalar_replaced" is used as an adjustment to keep record of the objects scalar replaced in **previous iterations**. However, this approach incorrectly handles numbers in the **last iteration**. In the last iteration _local_no_escape_ctr is always correct and we don't need to add the objects SR in that iteration. I see. this patch counts the scalarized objects twice in last iteration. need to subtract `mexp._local_scalar_replaced` . ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Wed May 11 01:00:39 2022 From: duke at openjdk.java.net (duke) Date: Wed, 11 May 2022 01:00:39 GMT Subject: Withdrawn: 8283103: compiler/vectorapi/TestMaskedMacroLogicVector.java failed due to incorrect os.simpleArch on some platforms In-Reply-To: References: Message-ID: On Mon, 14 Mar 2022 13:10:55 GMT, Ao Qi wrote: > `os.simpleArch` is not set correctly on some platforms, for example on loongarch64 and mips64 ([CODETOOLS-7903120](https://bugs.openjdk.java.net/browse/CODETOOLS-7903120)). This issue aims to let the test work on these platforms. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7809 From xgong at openjdk.java.net Wed May 11 03:26:57 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 11 May 2022 03:26:57 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 21:55:27 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > I modified the code of this PR to avoid the conversion of `boolean` to `int`, so a constant integer value is passed all the way through, and the masked load is made intrinsic from the method at which the constants are passed as arguments i.e. the public `fromArray` mask accepting method. Hi @PaulSandoz , thanks for the patch for the constant int parameter. I think the main change is: - ByteVector fromArray0Template(Class maskClass, C base, long offset, int index, M m, boolean offsetInRange, + ByteVector fromArray0Template(Class maskClass, C base, long offset, int index, M m, int offsetInRange, VectorSupport.LoadVectorMaskedOperation defaultImpl) { m.check(species()); ByteSpecies vsp = vspecies(); - if (offsetInRange) { - return VectorSupport.loadMasked( - vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), - base, offset, m, /* offsetInRange */ 1, - base, index, vsp, defaultImpl); - } else { - return VectorSupport.loadMasked( - vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), - base, offset, m, /* offsetInRange */ 0, - base, index, vsp, defaultImpl); - } + return VectorSupport.loadMasked( + vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), + base, offset, m, offsetInRange == 1 ? 1 : 0, + base, index, vsp, defaultImpl); } which uses `offsetInRange == 1 ? 1 : 0`. Unfortunately this could not always make sure the `offsetInRange` a constant a the compiler time. Again, this change could also make the assertion fail randomly: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "must be a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", Sometimes, the compiler can parse it a constant. I think this depends on the compiler OSR and speculative optimization. Did you try an example with IOOBE on a non predicated hardware? Here is the main code of my unittest to reproduce the issue: static final VectorSpecies I_SPECIES = IntVector.SPECIES_128; static final int LENGTH = 1026; public static int[] ia; public static int[] ib; private static void init() { for (int i = 0; i < LENGTH; i++) { ia[i] = i; ib[i] = 0; } for (int i = 0; i < 2; i++) { m[i] = i % 2 == 0; } } private static void func() { VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0); for (int i = 0; i < LENGTH; i += vl) { IntVector av = IntVector.fromArray(I_SPECIES, ia, i, mask); av.lanewise(VectorOperators.ABS).intoArray(ic, i, mask); } } public static void main(String[] args) { init(); for (int i = 0; i < 10000; i++) { func(); } } ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From shade at openjdk.java.net Wed May 11 05:32:41 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 11 May 2022 05:32:41 GMT Subject: RFR: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: <2LQ9wLWKSGT2MIsXqWPmTzI1Bviw23pGmqlI4o98hEE=.712eb9d2-b99f-4cb8-945f-671f0e442a8d@github.com> On Tue, 10 May 2022 08:11:12 GMT, Aleksey Shipilev wrote: > As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. > > Additional testing: > - [x] Test now passes in unusual test modes reported > - [x] Test still passes in default test modes on x86_32 and x86_64 Thanks for reviews! ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From shade at openjdk.java.net Wed May 11 05:32:41 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 11 May 2022 05:32:41 GMT Subject: Integrated: 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined In-Reply-To: References: Message-ID: On Tue, 10 May 2022 08:11:12 GMT, Aleksey Shipilev wrote: > As shown in the bug, the test relies on test methods in `Enum` and `String` to be inlined for constant folding to happen. In some testing modes, this does not happen. The fix is to force inlining for methods that test uses. It is kinda odd to ask force inlining of system class methods, but it does not seem problematic. > > Additional testing: > - [x] Test now passes in unusual test modes reported > - [x] Test still passes in default test modes on x86_32 and x86_64 This pull request has now been integrated. Changeset: 9c254841 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/9c2548414c71b4caaad6ad9e1b122f474e705300 Stats: 7 lines in 2 files changed: 5 ins; 1 del; 1 mod 8286339: compiler/c2/irTests/TestEnumFinalFold.java fails if Enum/String methods are not inlined Reviewed-by: thartmann, jiefu ------------- PR: https://git.openjdk.java.net/jdk/pull/8625 From chagedorn at openjdk.java.net Wed May 11 06:59:12 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 11 May 2022 06:59:12 GMT Subject: RFR: 8285965: TestScenarios.java does not check for "" correctly Message-ID: This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 Output for the reported failure: Scenario #3 - [-XX:TLABRefillWasteFraction=53]: [...] Found , bail out of IR matching I suggest to use the same fix for `TestScenarios`. Thanks, Christian ------------- Commit messages: - 8285965: TestScenarios.java does not check for "" correctly Changes: https://git.openjdk.java.net/jdk/pull/8647/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8647&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285965 Stats: 38 lines in 2 files changed: 11 ins; 16 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8647.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8647/head:pull/8647 PR: https://git.openjdk.java.net/jdk/pull/8647 From thartmann at openjdk.java.net Wed May 11 07:19:52 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 11 May 2022 07:19:52 GMT Subject: RFR: 8285965: TestScenarios.java does not check for "" correctly In-Reply-To: References: Message-ID: On Wed, 11 May 2022 06:13:12 GMT, Christian Hagedorn wrote: > This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. > > We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 > > But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. > > The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 > > Output for the reported failure: > > Scenario #3 - [-XX:TLABRefillWasteFraction=53]: > [...] > Found , bail out of IR matching > > > I suggest to use the same fix for `TestScenarios`. > > Thanks, > Christian Looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8647 From roland at openjdk.java.net Wed May 11 07:29:52 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 11 May 2022 07:29:52 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v2] In-Reply-To: References: Message-ID: On Fri, 6 May 2022 18:27:31 GMT, Vladimir Ivanov wrote: > Looks very good! Thanks for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From roland at openjdk.java.net Wed May 11 07:29:54 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 11 May 2022 07:29:54 GMT Subject: Integrated: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses In-Reply-To: References: Message-ID: On Mon, 6 Dec 2021 09:59:44 GMT, Roland Westrelin wrote: > Outside the type system code itself, c2 usually assumes that a > TypeOopPtr or a TypeKlassPtr's java type is fully represented by its > klass(). To have proper support for interfaces, that can't be true as > a type needs to be represented by an instance class and a set of > interfaces. This patch hides the klass() accessor of > TypeOopPtr/TypeKlassPtr and reworks c2 code that relies on it in a way > that makes that code suitable for proper interface support in a > subsequent change. This patch doesn't add proper interface support yet > and is mostly refactoring. "Mostly" because there are cases where the > previous logic would use a ciKlass but the new one works with a > TypeKlassPtr/TypeInstPtr which carries the ciKlass and whether the > klass is exact or not. That extra bit of information can sometimes > help and so could result in slightly different decisions. > > To remove the klass() accessors, the new logic either relies on: > > - new methods of TypeKlassPtr/TypeInstPtr. For instance, instead of: > toop->klass()->is_subtype_of(other_toop->klass()) > the new code is: > toop->is_java_subtype_of(other_toop) > > - variants of the klass() accessors for narrower cases like > TypeInstPtr::instance_klass() (returns _klass except if _klass is an > interface in which case it returns Object), > TypeOopPtr::unloaded_klass() (returns _klass but only when the klass > is unloaed), TypeOopPtr::exact_klass() (returns _klass but only when > the type is exact). > > When I tested this patch, for most changes in this patch, I had the > previous logic, the new logic and a check that verified that they > return the same result. I ran as much testing as I could that way. This pull request has now been integrated. Changeset: aa7ccdf4 Author: Roland Westrelin URL: https://git.openjdk.java.net/jdk/commit/aa7ccdf44549a52cce9e99f6569097d3343d9ee4 Stats: 1221 lines in 33 files changed: 560 ins; 180 del; 481 mod 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses Reviewed-by: vlivanov, iveresov ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From chagedorn at openjdk.java.net Wed May 11 07:35:35 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 11 May 2022 07:35:35 GMT Subject: RFR: 8285965: TestScenarios.java does not check for "" correctly In-Reply-To: References: Message-ID: On Wed, 11 May 2022 06:13:12 GMT, Christian Hagedorn wrote: > This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. > > We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 > > But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. > > The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 > > Output for the reported failure: > > Scenario #3 - [-XX:TLABRefillWasteFraction=53]: > [...] > Found , bail out of IR matching > > > I suggest to use the same fix for `TestScenarios`. > > Thanks, > Christian Thanks Tobias for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8647 From ngasson at openjdk.java.net Wed May 11 10:03:47 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Wed, 11 May 2022 10:03:47 GMT Subject: RFR: 8282966: AArch64: Optimize VectorMask.toLong with SVE2 [v2] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 01:31:52 GMT, Eric Liu wrote: >> This patch optimizes the backend implementation of VectorMaskToLong for >> AArch64, given a more efficient approach to mov value bits from >> predicate register to general purpose register as x86 PMOVMSK[1] does, >> by using BEXT[2] which is available in SVE2. >> >> With this patch, the final code (input mask is byte type with >> SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU >> emulator) changes as below: >> >> Before: >> >> mov z16.b, p0/z, #1 >> fmov x0, d16 >> orr x0, x0, x0, lsr #7 >> orr x0, x0, x0, lsr #14 >> orr x0, x0, x0, lsr #28 >> and x0, x0, #0xff >> fmov x8, v16.d[1] >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #8 >> >> orr x8, xzr, #0x2 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #16 >> >> orr x8, xzr, #0x3 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #24 >> >> orr x8, xzr, #0x4 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #32 >> >> mov x8, #0x5 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #40 >> >> orr x8, xzr, #0x6 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #48 >> >> orr x8, xzr, #0x7 >> whilele p1.d, xzr, x8 >> lastb x8, p1, z16.d >> orr x8, x8, x8, lsr #7 >> orr x8, x8, x8, lsr #14 >> orr x8, x8, x8, lsr #28 >> and x8, x8, #0xff >> orr x0, x0, x8, lsl #56 >> >> After: >> >> mov z16.b, p0/z, #1 >> mov z17.b, #1 >> bext z16.d, z16.d, z17.d >> mov z17.d, #0 >> uzp1 z16.s, z16.s, z17.s >> uzp1 z16.h, z16.h, z17.h >> uzp1 z16.b, z16.b, z17.b >> mov x0, v16.d[0] >> >> [1] https://www.felixcloutier.com/x86/pmovmskb >> [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- > > Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: > > - Merge jdk:master > > Change-Id: Ifa60f3b79513c22dbf932f1da623289687bc1070 > - 8282966: AArch64: Optimize VectorMask.toLong with SVE2 > > This patch optimizes the backend implementation of VectorMaskToLong for > AArch64, given a more efficient approach to mov value bits from > predicate register to general purpose register as x86 PMOVMSK[1] does, > by using BEXT[2] which is available in SVE2. > > With this patch, the final code (input mask is byte type with > SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU > emulator) changes as below: > > Before: > > mov z16.b, p0/z, #1 > fmov x0, d16 > orr x0, x0, x0, lsr #7 > orr x0, x0, x0, lsr #14 > orr x0, x0, x0, lsr #28 > and x0, x0, #0xff > fmov x8, v16.d[1] > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #8 > > orr x8, xzr, #0x2 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #16 > > orr x8, xzr, #0x3 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #24 > > orr x8, xzr, #0x4 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #32 > > mov x8, #0x5 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #40 > > orr x8, xzr, #0x6 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #48 > > orr x8, xzr, #0x7 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #56 > > After: > > mov z16.b, p0/z, #1 > mov z17.b, #1 > bext z16.d, z16.d, z17.d > mov z17.d, #0 > uzp1 z16.s, z16.s, z17.s > uzp1 z16.h, z16.h, z17.h > uzp1 z16.b, z16.b, z17.b > mov x0, v16.d[0] > > [1] https://www.felixcloutier.com/x86/pmovmskb > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- > > Change-Id: Ia983a20c89f76403e557ac21328f2f2e05dd08e0 Looks OK to me. ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8337 From jvernee at openjdk.java.net Wed May 11 10:45:12 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 10:45:12 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 18:44:01 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/cpu/aarch64/frame_aarch64.cpp line 379: > >> 377: // need unextended_sp here, since normal sp is wrong for interpreter callees >> 378: return reinterpret_cast( >> 379: reinterpret_cast(frame.unextended_sp()) + in_bytes(_frame_data_offset)); > > Maybe use `address` instead of `char*`? Ok. I think I used `char*` to try and avoid a potential strict-aliasing violation, but I don't think we compile with that turned on any ways. Will change it to `address` (for x86 too) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 10:50:47 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 10:50:47 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 18:48:08 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5547: > >> 5545: } else if (dst.first()->is_stack()) { >> 5546: // reg to stack >> 5547: // Do we really have to sign extend??? > > Obsolete? Remove? Yes, this looks like it can be removed. (was copied over from SharedRuntime_aarch64) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 10:55:09 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 10:55:09 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 18:55:03 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/cpu/x86/foreign_globals_x86.hpp line 30: > >> 28: #include "utilities/growableArray.hpp" >> 29: >> 30: class outputStream; > > Redundant declaration? Yeah, this whole file is redundant :) (replaced by foreign_globals_x86_64.hpp) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 11:02:47 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 11:02:47 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 19:16:35 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/share/code/codeBlob.hpp line 754: > >> 752: class ProgrammableUpcallHandler; >> 753: >> 754: class OptimizedEntryBlob: public RuntimeBlob { > > What's the motivation to move `OptimizedEntryBlob` up in the hierarchy (from `BufferBlob` to `RuntimeBlob`)? Some of it is discussed here: https://github.com/openjdk/panama-foreign/pull/617 Essentially, it is to avoid accidentally inheriting behavior from BufferBlob which we don't want. Also, BufferBlob currently expects a fixed-sized header (`sizeof(BufferBlob)`), while OptimizedEntryBlobs has fields, so we'd have to pass the real header size to the `BufferBlob` constructor, which is a bit messy. It felt better to just cleanly break away from BufferBlob. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 11:09:52 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 11:09:52 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 19:21:58 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/share/opto/callGenerator.cpp line 1131: > >> 1129: >> 1130: case vmIntrinsics::_linkToNative: >> 1131: print_inlining_failure(C, callee, jvms->depth() - 1, jvms->bci(), > > Why is it unconditionally reported as inlining failure? The call that is being processed here is `linkToNative`, and that call is not inlined, so reporting an inlining failure seems correct. We still go through the method handle trampoline stub which loads the actual target from the NativeEntryPoint (`jump_to_native_invoker` in methodHandles_x86.cpp). It's potentially faster here to generate a runtime call to the underlying invoker/downcall stub if the NativeEntryPoint is constant (i.e. avoid the lookup through NEP in the MH trampoline), but I hadn't gotten to investigating that yet. >From comparing the benchmark times between this and the old implementation (which generated an inline call), they are not all that different. So it seemed that doing something special here would not save that much time any ways. (but, still would be good to investigate at some point) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 11:13:53 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 11:13:53 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 20:30:09 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/share/prims/foreign_globals.cpp line 147: > >> 145: // based on ComputeMoveOrder from x86_64 shared runtime code. >> 146: // with some changes. >> 147: class ForeignCMO: public StackObj { > > Considering how seldom it is used, I don't see much value in abbreviating it. Also, the comment is misleading: there's no such entity as `ComputeMoveOrder` in the code. And `compute_move_order` is completely removed by this change. Good points, I think we can just rename this class to `ComputeMoveOrder` at this point. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 11:28:05 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 11:28:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 20:48:47 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/share/prims/foreign_globals.hpp line 35: > >> 33: #include CPU_HEADER(foreign_globals) >> 34: >> 35: class CallConvClosure { > > Just a question on terminology: why is it called a `Closure`? It is the terminology used in other parts of hotspot for function objects it seems. See for instance the classes in `iterator.hpp` > src/hotspot/share/prims/foreign_globals.hpp line 62: > >> 60: >> 61: >> 62: class JavaCallConv : public CallConvClosure { > > Does it really worth to abbreviate `CallingConvention` to `CallConv`? Maybe not... I'll spell out the full thing. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 12:08:53 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 12:08:53 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 21:01:48 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/cpu/aarch64/universalUpcallHandler_aarch64.cpp line 306: > >> 304: intptr_t exception_handler_offset = __ pc() - start; >> 305: >> 306: // Native caller has no idea how to handle exceptions, > > Can you elaborate, please, how it is expected to work in presence of asynchronous exceptions? I'd expect to see a code which unconditionally clears pending exception with an assertion that verifies that the exception is of expected type. We have an exception handler in Java as well, so this code is only a fail safe. But, I think in the case of asynchronous exceptions this might be problematic if the exception is discovered by the current thread outside of the Java exception handler, turned into a synchronous exception and then we get here and call `ProgrammableUpcallhandler::handle_uncaught_exception` and then crash. Or if the asynchronous exception is discovered in `ProgrammableUpcallHandler::on_exit` (where there is currently an assert for no exceptions). I think you're right that, in both of those cases, if the exception is asynchronous, we should just ignore it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 12:19:05 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 12:19:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Wed, 11 May 2022 10:51:10 GMT, Jorn Vernee wrote: >> src/hotspot/cpu/x86/foreign_globals_x86.hpp line 30: >> >>> 28: #include "utilities/growableArray.hpp" >>> 29: >>> 30: class outputStream; >> >> Redundant declaration? > > Yeah, this whole file is redundant :) (replaced by foreign_globals_x86_64.hpp) Hmm, it doesn't look like having x64 specific header files is support by the CPU_HEADER macros. Will stick to the shared header file for x86_32 and x86_64 for now. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 12:23:59 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 12:23:59 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Tue, 10 May 2022 20:45:02 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/cpu/x86/foreign_globals_x86_64.cpp line 52: > >> 50: >> 51: objArrayOop inputStorage = jdk_internal_foreign_abi_ABIDescriptor::inputStorage(abi_oop); >> 52: loadArray(inputStorage, INTEGER_TYPE, abi._integer_argument_registers, as_Register); > > `loadArray` helper looks a bit misleading. In presence of `javaClass`-style accessors, it misleadingly hints that it refers to some Java-level operation/entity, though what it does it parses register list representation backed by a Java array. I suggest to rename it to something like `parse_argument_registers_array()`). I went with `parse_register_array` (since it's also used for return registers) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 12:24:02 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 12:24:02 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: Message-ID: On Tue, 10 May 2022 20:53:37 GMT, Vladimir Ivanov wrote: >> src/hotspot/share/utilities/growableArray.hpp line 151: >> >>> 149: return _data; >>> 150: } >>> 151: >> >> This accessor is added to be able to temporarily view a stable GrowableArray instance as a C-style array. It is used to by `NativeCallConv` and `RegSpiller` in `foreign_globals.hpp`. >> >> GrowableArray already has an `adr_at` accessor that does something similar, but using `adr_at(0)` fails on empty growable arrays since it also performs a bounds check. So it can not be used. > > Any problems with migrating `CallConv` and `RegSpiller`away from ` VMReg* + int` to `GrowableArray`? I'll try migrating to `GrowableArray` ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From fgao at openjdk.java.net Wed May 11 13:42:42 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 11 May 2022 13:42:42 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 Can I get a second review please :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From jvernee at openjdk.java.net Wed May 11 14:29:19 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 14:29:19 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v8] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with two additional commits since the last revision: - Migrate to GrowableArray - Address some review comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/43fd1b91..abd2b6ca Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=06-07 Stats: 254 lines in 23 files changed: 22 ins; 125 del; 107 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 14:29:25 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 14:29:25 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: <9eViilPD2YEH0Lt8StEhiGJfDlMSSa1NpPC02MKItuM=.0ce2490e-9a53-46b3-a14b-ab88b4a0f3fc@github.com> On Tue, 10 May 2022 21:02:39 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > Nice work! Looks good. Some minor comments/questions follow. @iwanowww Thanks for the review! I've addressed most comments already, but will have to think a bit on how to handle the asynchronous exceptions issue. > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5531: > >> 5529: } >> 5530: >> 5531: // On 64 bit we will store integer like items to the stack as > > Time for a cleanup? `64 bit` vs `64bit`, `abi`, `Aarch64`. I've cleaned up the spaces and capitalization here a bit. > src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 36: > >> 34: #include "code/nativeInst.hpp" >> 35: #include "code/vtableStubs.hpp" >> 36: #include "compiler/disassembler.hpp" > > Redundant includes? No new code added in the file. Good catch. Seems like a merge artifact maybe. Removing them seems to be fine. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 14:33:56 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 14:33:56 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: On Wed, 11 May 2022 11:06:51 GMT, Jorn Vernee wrote: >> src/hotspot/share/opto/callGenerator.cpp line 1131: >> >>> 1129: >>> 1130: case vmIntrinsics::_linkToNative: >>> 1131: print_inlining_failure(C, callee, jvms->depth() - 1, jvms->bci(), >> >> Why is it unconditionally reported as inlining failure? > > The call that is being processed here is `linkToNative`, and that call is not inlined, so reporting an inlining failure seems correct. We still go through the method handle trampoline stub which loads the actual target from the NativeEntryPoint appendix argument (`jump_to_native_invoker` in methodHandles_x86.cpp). > > It's potentially faster here to generate a runtime call to the underlying invoker/downcall stub if the NativeEntryPoint is constant (i.e. avoid the lookup through NEP in the MH trampoline), but I hadn't gotten to investigating that yet. > > From comparing the benchmark times between this and the old implementation (which generated an inline call), they are not all that different. So it seemed that doing something special here would not save that much time any ways. (but, still would be good to investigate at some point) I've filed: https://bugs.openjdk.java.net/browse/JDK-8286588 ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 14:34:02 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 14:34:02 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: <1xENEcoejLZmUfLYTJsJ-nUWF9dlZ3BdXdpvSU_JoPk=.8efe4cde-cdc1-4174-a4a2-5437464f55d2@github.com> On Tue, 10 May 2022 20:38:05 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 21 commits: >> >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - Remove spurious ProblemList change >> - Pass pointer to LogStream >> - Polish >> - Replace TraceNativeInvokers flag with unified logging >> - Fix other platforms, take 2 >> - ... and 11 more: https://git.openjdk.java.net/jdk/compare/3c88a2ef...43fd1b91 > > src/hotspot/share/prims/foreign_globals.cpp line 217: > >> 215: >> 216: public: >> 217: ForeignCMO(int total_in_args, const VMRegPair* in_regs, int total_out_args, VMRegPair* out_regs, > > I propose to turn it into a trivial ctor and move all the logic into a helper static function which returns the computed moves. Done. Moved constructor logic into a `compute` method, and added a static helper that constructs a ComputeMoveOrder, calls `compute`, and then returns the moves. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From redestad at openjdk.java.net Wed May 11 15:04:21 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 11 May 2022 15:04:21 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks Message-ID: #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). ------------- Commit messages: - 8286401: Address possibly lossy conversions in Microbenchmarks Changes: https://git.openjdk.java.net/jdk/pull/8654/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8654&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286401 Stats: 166 lines in 12 files changed: 0 ins; 0 del; 166 mod Patch: https://git.openjdk.java.net/jdk/pull/8654.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8654/head:pull/8654 PR: https://git.openjdk.java.net/jdk/pull/8654 From psandoz at openjdk.java.net Wed May 11 15:07:57 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 11 May 2022 15:07:57 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: <4LG2tZgoxvuaUEi78DyUrMbI9dIOM8CPk7GbkpZtp6M=.49894db1-a271-47a1-b8dc-68a1d5f46915@github.com> On Wed, 11 May 2022 03:23:13 GMT, Xiaohong Gong wrote: >> I modified the code of this PR to avoid the conversion of `boolean` to `int`, so a constant integer value is passed all the way through, and the masked load is made intrinsic from the method at which the constants are passed as arguments i.e. the public `fromArray` mask accepting method. > > Hi @PaulSandoz , thanks for the patch for the constant int parameter. I think the main change is: > > - ByteVector fromArray0Template(Class maskClass, C base, long offset, int index, M m, boolean offsetInRange, > + ByteVector fromArray0Template(Class maskClass, C base, long offset, int index, M m, int offsetInRange, > VectorSupport.LoadVectorMaskedOperation defaultImpl) { > m.check(species()); > ByteSpecies vsp = vspecies(); > - if (offsetInRange) { > - return VectorSupport.loadMasked( > - vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), > - base, offset, m, /* offsetInRange */ 1, > - base, index, vsp, defaultImpl); > - } else { > - return VectorSupport.loadMasked( > - vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), > - base, offset, m, /* offsetInRange */ 0, > - base, index, vsp, defaultImpl); > - } > + return VectorSupport.loadMasked( > + vsp.vectorType(), maskClass, vsp.elementType(), vsp.laneCount(), > + base, offset, m, offsetInRange == 1 ? 1 : 0, > + base, index, vsp, defaultImpl); > } > > which uses `offsetInRange == 1 ? 1 : 0`. Unfortunately this could not always make sure the `offsetInRange` a constant a the compiler time. Again, this change could also make the assertion fail randomly: > > --- a/src/hotspot/share/opto/vectorIntrinsics.cpp > +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp > @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { > } else { > // Masked vector load with IOOBE always uses the predicated load. > const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); > + assert(offset_in_range->is_con(), "must be a constant"); > if (!offset_in_range->is_con()) { > if (C->print_intrinsics()) { > tty->print_cr(" ** missing constant: offsetInRange=%s", > > Sometimes, the compiler can parse it a constant. I think this depends on the compiler OSR and speculative optimization. Did you try an example with IOOBE on a non predicated hardware? > > Here is the main code of my unittest to reproduce the issue: > > static final VectorSpecies I_SPECIES = IntVector.SPECIES_128; > static final int LENGTH = 1026; > public static int[] ia; > public static int[] ib; > > private static void init() { > for (int i = 0; i < LENGTH; i++) { > ia[i] = i; > ib[i] = 0; > } > > for (int i = 0; i < 2; i++) { > m[i] = i % 2 == 0; > } > } > > private static void func() { > VectorMask mask = VectorMask.fromArray(I_SPECIES, m, 0); > for (int i = 0; i < LENGTH; i += vl) { > IntVector av = IntVector.fromArray(I_SPECIES, ia, i, mask); > av.lanewise(VectorOperators.ABS).intoArray(ic, i, mask); > } > } > > public static void main(String[] args) { > init(); > for (int i = 0; i < 10000; i++) { > func(); > } > } @XiaohongGong Doh! The ternary was an experiment, and I forgot to re-run the code gen script before sending your the patch. See `IntVector`, which does not have that. I presume when the offset is not in range and the other code path is taken then it might be problematic unless all code paths are inlined. I will experiment further with tests. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From jvernee at openjdk.java.net Wed May 11 15:24:39 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 15:24:39 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v9] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Fix use of rt_call ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/abd2b6ca..c6754a1c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=07-08 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From shade at openjdk.java.net Wed May 11 15:26:45 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 11 May 2022 15:26:45 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks In-Reply-To: References: Message-ID: On Wed, 11 May 2022 14:57:16 GMT, Claes Redestad wrote: > #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. > > I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). I have questions. Also, copyright dates are not consistently updated in affected files? test/micro/org/openjdk/bench/vm/compiler/PointerBenchmarkFlat.java line 151: > 149: int sum = 0; > 150: for (int i = 0 ; i < ELEM_SIZE ; i++) { > 151: sum += (int)ptr_ptr.get(i).address().toRawLongValue(); Here and later: `toRawLongValue` returns `long`, right? So why don't we change the accumulator and return value to `long`, like we do in other tests? ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From jvernee at openjdk.java.net Wed May 11 15:47:11 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 15:47:11 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> Message-ID: <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> On Wed, 11 May 2022 12:05:29 GMT, Jorn Vernee wrote: >> src/hotspot/cpu/aarch64/universalUpcallHandler_aarch64.cpp line 306: >> >>> 304: intptr_t exception_handler_offset = __ pc() - start; >>> 305: >>> 306: // Native caller has no idea how to handle exceptions, >> >> Can you elaborate, please, how it is expected to work in presence of asynchronous exceptions? I'd expect to see a code which unconditionally clears pending exception with an assertion that verifies that the exception is of expected type. > > We have an exception handler in Java as well, so this code is only a fail safe. But, I think in the case of asynchronous exceptions this might be problematic if the exception is discovered by the current thread outside of the Java exception handler, turned into a synchronous exception and then we get here and call `ProgrammableUpcallhandler::handle_uncaught_exception` and then crash. Or if the asynchronous exception is discovered in `ProgrammableUpcallHandler::on_exit` (where there is currently an assert for no exceptions). > > I think you're right that, in both of those cases, if the exception is asynchronous, we should just ignore it. Discussed this with Maurizio as well. My understanding is as follows: 1. Async exceptions are installed into a thread's `pending_exception` field by handshake at a safepoint 2. From there they are "thrown" (I guess during the same safepoint?), in which case we either end up in a user-defined exception handler (higher up the stack), in our Java fallback exception handler (print stack trace and terminate), or in this fallback exception handler (print and terminate). 3. If we end up in this exception handler it means the async exception was installed somewhere outside of our Java exception handler and "thrown" from there. However, it also means that the Java code we were calling into completed abruptly. 4. We've previously established that we have no way of signalling to the native code that is calling us that something went wrong, and so the only safe option is to terminate, as to not leave the application in an inconsistent state. As a consequence, I don't think we have much choice in the case of async exceptions if we get here. Silently clearing them seems like it will leave the program in an inconsistent state (since we unwound some frames), so we have to terminate I think. (@dholmes-ora is my understanding of async exceptions in point 1. and 2. correct here?) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 15:47:12 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 15:47:12 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v9] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 12:20:43 GMT, Jorn Vernee wrote: >> Any problems with migrating `CallConv` and `RegSpiller`away from ` VMReg* + int` to `GrowableArray`? > > I'll try migrating to `GrowableArray` Done. I've removed these accessors as well. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From redestad at openjdk.java.net Wed May 11 15:50:40 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 11 May 2022 15:50:40 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: > #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. > > I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Copyrights, consistently use the exact accumulator type ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8654/files - new: https://git.openjdk.java.net/jdk/pull/8654/files/5cba7820..41f37a25 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8654&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8654&range=00-01 Stats: 17 lines in 11 files changed: 0 ins; 0 del; 17 mod Patch: https://git.openjdk.java.net/jdk/pull/8654.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8654/head:pull/8654 PR: https://git.openjdk.java.net/jdk/pull/8654 From shade at openjdk.java.net Wed May 11 15:50:40 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 11 May 2022 15:50:40 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 15:47:29 GMT, Claes Redestad wrote: >> #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. >> >> I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Copyrights, consistently use the exact accumulator type Marked as reviewed by shade (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From redestad at openjdk.java.net Wed May 11 15:50:41 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 11 May 2022 15:50:41 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 15:21:51 GMT, Aleksey Shipilev wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Copyrights, consistently use the exact accumulator type > > test/micro/org/openjdk/bench/vm/compiler/PointerBenchmarkFlat.java line 151: > >> 149: int sum = 0; >> 150: for (int i = 0 ; i < ELEM_SIZE ; i++) { >> 151: sum += (int)ptr_ptr.get(i).address().toRawLongValue(); > > Here and later: `toRawLongValue` returns `long`, right? So why don't we change the accumulator and return value to `long`, like we do in other tests? Yeah, let's be consistent. I have verified that this does not affect the raw scores of these benchmarks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From redestad at openjdk.java.net Wed May 11 16:03:49 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 11 May 2022 16:03:49 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 15:50:40 GMT, Claes Redestad wrote: >> #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. >> >> I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Copyrights, consistently use the exact accumulator type Thanks for reviewing. I'll let the GHA tests complete and integrate this tomorrow if all is clear. ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From ecaspole at openjdk.java.net Wed May 11 16:18:52 2022 From: ecaspole at openjdk.java.net (Eric Caspole) Date: Wed, 11 May 2022 16:18:52 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 15:50:40 GMT, Claes Redestad wrote: >> #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. >> >> I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Copyrights, consistently use the exact accumulator type Looks good. ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From pchilanomate at openjdk.java.net Wed May 11 16:24:02 2022 From: pchilanomate at openjdk.java.net (Patricio Chilano Mateo) Date: Wed, 11 May 2022 16:24:02 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> Message-ID: <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> On Wed, 11 May 2022 15:44:19 GMT, Jorn Vernee wrote: >> We have an exception handler in Java as well, so this code is only a fail safe. But, I think in the case of asynchronous exceptions this might be problematic if the exception is discovered by the current thread outside of the Java exception handler, turned into a synchronous exception and then we get here and call `ProgrammableUpcallhandler::handle_uncaught_exception` and then crash. Or if the asynchronous exception is discovered in `ProgrammableUpcallHandler::on_exit` (where there is currently an assert for no exceptions). >> >> I think you're right that, in both of those cases, if the exception is asynchronous, we should just ignore it. > > Discussed this with Maurizio as well. > > My understanding is as follows: > 1. Async exceptions are installed into a thread's `pending_exception` field by handshake at a safepoint > 2. From there they are "thrown" (I guess during the same safepoint?), in which case we either end up in a user-defined exception handler (higher up the stack), in our Java fallback exception handler (print stack trace and terminate), or in this fallback exception handler (print and terminate). > 3. If we end up in this exception handler it means the async exception was installed somewhere outside of our Java exception handler and "thrown" from there. However, it also means that the Java code we were calling into completed abruptly. > 4. We've previously established that we have no way of signalling to the native code that is calling us that something went wrong, and so the only safe option is to terminate, as to not leave the application in an inconsistent state. > > As a consequence, I don't think we have much choice in the case of async exceptions if we get here. Silently clearing them seems like it will leave the program in an inconsistent state (since we unwound some frames), so we have to terminate I think. > > (@dholmes-ora is my understanding of async exceptions in point 1. and 2. correct here?) If you want to avoid processing asynchronous exceptions during this upcall you could block them (check NoAsyncExceptionDeliveryMark in JavaThread::exit()). Seems you could set the flag in ProgrammableUpcallhandler::on_entry() and unset it back on ProgrammableUpcallhandler::on_exit(). While that flag is set any asynchronous exception in the handshake queue of this thread will be skipped from processing. Maybe we should add a public method in the JavaThread class, block_async_exceptions()/unblock_async_exceptions() so we hide the handshake implementation. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From ecaspole at openjdk.java.net Wed May 11 16:38:51 2022 From: ecaspole at openjdk.java.net (Eric Caspole) Date: Wed, 11 May 2022 16:38:51 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: <6v98y-ilQDIQiWxh_Cq_tiKszBQS8GjJ0oPMlc8W_GI=.59ad3686-d1aa-4b0b-b742-33e83c7fb795@github.com> On Wed, 11 May 2022 15:50:40 GMT, Claes Redestad wrote: >> #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. >> >> I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Copyrights, consistently use the exact accumulator type Looks good, thanks for fixing this. ------------- Marked as reviewed by ecaspole (Committer). PR: https://git.openjdk.java.net/jdk/pull/8654 From shade at openjdk.java.net Wed May 11 16:42:47 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 11 May 2022 16:42:47 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 16:00:42 GMT, Claes Redestad wrote: > Thanks for reviewing. I'll let the GHA tests complete and integrate this tomorrow if all is clear. I don't think GHA builds any microbenchmarks (because JMH is not enabled there), so there is no point to wait for those. ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From jvernee at openjdk.java.net Wed May 11 16:42:59 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 16:42:59 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> Message-ID: On Wed, 11 May 2022 16:20:32 GMT, Patricio Chilano Mateo wrote: >> Discussed this with Maurizio as well. >> >> My understanding is as follows: >> 1. Async exceptions are installed into a thread's `pending_exception` field by handshake at a safepoint >> 2. From there they are "thrown" (I guess during the same safepoint?), in which case we either end up in a user-defined exception handler (higher up the stack), in our Java fallback exception handler (print stack trace and terminate), or in this fallback exception handler (print and terminate). >> 3. If we end up in this exception handler it means the async exception was installed somewhere outside of our Java exception handler and "thrown" from there. However, it also means that the Java code we were calling into completed abruptly. >> 4. We've previously established that we have no way of signalling to the native code that is calling us that something went wrong, and so the only safe option is to terminate, as to not leave the application in an inconsistent state. >> >> As a consequence, I don't think we have much choice in the case of async exceptions if we get here. Silently clearing them seems like it will leave the program in an inconsistent state (since we unwound some frames), so we have to terminate I think. >> >> (@dholmes-ora is my understanding of async exceptions in point 1. and 2. correct here?) > > If you want to avoid processing asynchronous exceptions during this upcall you could block them (check NoAsyncExceptionDeliveryMark in JavaThread::exit()). Seems you could set the flag in ProgrammableUpcallhandler::on_entry() and unset it back on ProgrammableUpcallhandler::on_exit(). While that flag is set any asynchronous exception in the handshake queue of this thread will be skipped from processing. Maybe we should add a public method in the JavaThread class, block_async_exceptions()/unblock_async_exceptions() so we hide the handshake implementation. Oh nice! I was just thinking that the only possible way out of this conundrum would be to somehow block the delivery of async exceptions (at least outside of the user's exception handler). So, that seems to be exactly what we need :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 17:51:31 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 17:51:31 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 26 commits: - Block async exceptions during upcalls - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 - Fix use of rt_call - Migrate to GrowableArray - Address some review comments - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 - Remove unneeded ComputeMoveOrder - Remove comment about native calls in lcm.cpp - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 Reviewed-by: jvernee, mcimadamore - Update riscv and arm stubs - ... and 16 more: https://git.openjdk.java.net/jdk/compare/cdd006e7...b29ad8f4 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7959/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=09 Stats: 6870 lines in 157 files changed: 2596 ins; 3218 del; 1056 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 17:58:46 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 17:58:46 GMT Subject: RFR: 8286002: Add support for intel syntax to capstone hsdis In-Reply-To: References: Message-ID: On Tue, 10 May 2022 08:19:04 GMT, Tobias Hartmann wrote: >> This patch adds support for outputting assembly in intel syntax to capstone hsdis, through the `-XX:PrintAssemblyOptions=intel` flag. >> >> Snippet of example output: >> >> >> [Verified Entry Point] >> # {method} {0x0000021c8a4002d8} 'add' '(II)I' in 'Main' >> # parm0: rdx = int >> # parm1: r8 = int >> # [sp+0x20] (sp of caller) >> 0x0000021cfa713780: sub rsp, 0x18 >> 0x0000021cfa713787: mov qword ptr [rsp + 0x10], rbp >> 0x0000021cfa71378c: mov eax, edx >> 0x0000021cfa71378e: add eax, r8d >> 0x0000021cfa713791: add rsp, 0x10 >> 0x0000021cfa713795: pop rbp >> 0x0000021cfa713796: cmp rsp, qword ptr [r15 + 0x338] >> ; {poll_return} >> 0x0000021cfa71379d: ja 0x21cfa7137a4 >> 0x0000021cfa7137a3: ret >> 0x0000021cfa7137a4: movabs r10, 0x21cfa713796 ; {internal_word} >> 0x0000021cfa7137ae: mov qword ptr [r15 + 0x350], r10 >> 0x0000021cfa7137b5: jmp 0x21cfa6f3400 ; {runtime_call SafepointBlob} >> ``` >> >> Testing: >> - Manual testing with and without `-XX:PrintAssemblyOptions=intel`, to make sure that both syntaxes work. >> - Manual testing with several different invalid options such as `-XX:PrintAssemblyOptions=asdf,,` to make sure that invalid options are handled correctly. >> >> Thanks, >> Jorn > > Looks reasonable. @TobiHartmann Thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8502 From jvernee at openjdk.java.net Wed May 11 17:58:46 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 17:58:46 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> Message-ID: On Wed, 11 May 2022 16:38:54 GMT, Jorn Vernee wrote: >> If you want to avoid processing asynchronous exceptions during this upcall you could block them (check NoAsyncExceptionDeliveryMark in JavaThread::exit()). Seems you could set the flag in ProgrammableUpcallhandler::on_entry() and unset it back on ProgrammableUpcallhandler::on_exit(). While that flag is set any asynchronous exception in the handshake queue of this thread will be skipped from processing. Maybe we should add a public method in the JavaThread class, block_async_exceptions()/unblock_async_exceptions() so we hide the handshake implementation. > > Oh nice! I was just thinking that the only possible way out of this conundrum would be to somehow block the delivery of async exceptions (at least outside of the user's exception handler). So, that seems to be exactly what we need :) I went ahead and implemented this suggestion. Now we block async exceptions in on_entry, and unblock in on_exit. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Wed May 11 18:01:49 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 11 May 2022 18:01:49 GMT Subject: Integrated: 8286002: Add support for intel syntax to capstone hsdis In-Reply-To: References: Message-ID: On Mon, 2 May 2022 14:01:53 GMT, Jorn Vernee wrote: > This patch adds support for outputting assembly in intel syntax to capstone hsdis, through the `-XX:PrintAssemblyOptions=intel` flag. > > Snippet of example output: > > > [Verified Entry Point] > # {method} {0x0000021c8a4002d8} 'add' '(II)I' in 'Main' > # parm0: rdx = int > # parm1: r8 = int > # [sp+0x20] (sp of caller) > 0x0000021cfa713780: sub rsp, 0x18 > 0x0000021cfa713787: mov qword ptr [rsp + 0x10], rbp > 0x0000021cfa71378c: mov eax, edx > 0x0000021cfa71378e: add eax, r8d > 0x0000021cfa713791: add rsp, 0x10 > 0x0000021cfa713795: pop rbp > 0x0000021cfa713796: cmp rsp, qword ptr [r15 + 0x338] > ; {poll_return} > 0x0000021cfa71379d: ja 0x21cfa7137a4 > 0x0000021cfa7137a3: ret > 0x0000021cfa7137a4: movabs r10, 0x21cfa713796 ; {internal_word} > 0x0000021cfa7137ae: mov qword ptr [r15 + 0x350], r10 > 0x0000021cfa7137b5: jmp 0x21cfa6f3400 ; {runtime_call SafepointBlob} > ``` > > Testing: > - Manual testing with and without `-XX:PrintAssemblyOptions=intel`, to make sure that both syntaxes work. > - Manual testing with several different invalid options such as `-XX:PrintAssemblyOptions=asdf,,` to make sure that invalid options are handled correctly. > > Thanks, > Jorn This pull request has now been integrated. Changeset: 4ad8cfa2 Author: Jorn Vernee URL: https://git.openjdk.java.net/jdk/commit/4ad8cfa26eb645f15a0aa77a58b2c333ded55c77 Stats: 37 lines in 1 file changed: 33 ins; 0 del; 4 mod 8286002: Add support for intel syntax to capstone hsdis Reviewed-by: thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8502 From psandoz at openjdk.java.net Wed May 11 19:49:47 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 11 May 2022 19:49:47 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename "use_predicate" to "needs_predicate" I tried your test code with the patch and logged compilation (`-XX:-TieredCompilation -XX:+PrintCompilation -XX:+PrintInlining -XX:+PrintIntrinsics -Xbatch`) For `func` the first call to `VectorSupport::loadMasked` is intrinsic and inlined: @ 45 jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes) (intrinsic) But the second call (for the last loop iteration) fails to inline: @ 45 jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes) failed to inline (intrinsic) Since i am running on an mac book this looks right and aligns with the `-XX:+PrintIntrinsics` output: ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it ** not supported: op=loadMasked vlen=8 etype=int using_byte_array=0 ? I have not looked at the code gen nor measured performance comparing the case when never out of bounds and only out of bounds for the last loop iteration. ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From pchilanomate at openjdk.java.net Wed May 11 20:31:05 2022 From: pchilanomate at openjdk.java.net (Patricio Chilano Mateo) Date: Wed, 11 May 2022 20:31:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> Message-ID: On Wed, 11 May 2022 17:55:16 GMT, Jorn Vernee wrote: >> Oh nice! I was just thinking that the only possible way out of this conundrum would be to somehow block the delivery of async exceptions (at least outside of the user's exception handler). So, that seems to be exactly what we need :) > > I went ahead and implemented this suggestion. Now we block async exceptions in on_entry, and unblock in on_exit. Is it possible for these upcalls to be nested? If yes, we could add a boolean to context to avoid unsetting the flag in those nested cases. And now that I think we should probably add that check in NoAsyncExceptionDeliveryMark too if we allow broader use of this flag. David added the NoAsyncExceptionDeliveryMark code with that assert about nesting so maybe he might have more insights about that. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From redestad at openjdk.java.net Wed May 11 20:54:50 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 11 May 2022 20:54:50 GMT Subject: RFR: 8286401: Address possibly lossy conversions in Microbenchmarks [v2] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 16:39:18 GMT, Aleksey Shipilev wrote: > > Thanks for reviewing. I'll let the GHA tests complete and integrate this tomorrow if all is clear. > > I don't think GHA builds any microbenchmarks (because JMH is not enabled there), so there is no point to wait for those. Good to know, thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From redestad at openjdk.java.net Wed May 11 20:54:53 2022 From: redestad at openjdk.java.net (Claes Redestad) Date: Wed, 11 May 2022 20:54:53 GMT Subject: Integrated: 8286401: Address possibly lossy conversions in Microbenchmarks In-Reply-To: References: Message-ID: On Wed, 11 May 2022 14:57:16 GMT, Claes Redestad wrote: > #8599 would add a new warning. This address the conversions in the microbenchmark component by means of making the types precise or adding explicit casts. There's quite a few changes in the ByteBuffers benchmarks, but the real change is in the template as these are generated. > > I've run through a subset of the affected benchmarks and verified that the results are either neutral or improve somewhat (seem to be the case in a few of the ByteBuffer micros). This pull request has now been integrated. Changeset: 1586bf86 Author: Claes Redestad URL: https://git.openjdk.java.net/jdk/commit/1586bf862b6faa6477630fad2e62b198771ad187 Stats: 179 lines in 13 files changed: 0 ins; 0 del; 179 mod 8286401: Address possibly lossy conversions in Microbenchmarks Reviewed-by: shade, ecaspole ------------- PR: https://git.openjdk.java.net/jdk/pull/8654 From duke at openjdk.java.net Wed May 11 21:45:00 2022 From: duke at openjdk.java.net (duke) Date: Wed, 11 May 2022 21:45:00 GMT Subject: Withdrawn: 8283085: AArch64: No need to leave a breadcrumb for JavaFrameAnchor::capture_last_Java_pc when leaf call In-Reply-To: References: Message-ID: On Tue, 15 Mar 2022 09:22:14 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this fix that remove the breadcrumb in `aarch64_enc_java_to_runtime` for `Op_CallLeaf`, `Op_CallLeafNoFP` and `Op_CallNative`. > > For more details please refer to the description of the issue. > > Thanks, > Denghui This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7815 From eliu at openjdk.java.net Thu May 12 01:18:44 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 12 May 2022 01:18:44 GMT Subject: Integrated: 8282966: AArch64: Optimize VectorMask.toLong with SVE2 In-Reply-To: References: Message-ID: On Thu, 21 Apr 2022 12:17:57 GMT, Eric Liu wrote: > This patch optimizes the backend implementation of VectorMaskToLong for > AArch64, given a more efficient approach to mov value bits from > predicate register to general purpose register as x86 PMOVMSK[1] does, > by using BEXT[2] which is available in SVE2. > > With this patch, the final code (input mask is byte type with > SPECIESE_512, generated on an SVE vector reg size of 512-bit QEMU > emulator) changes as below: > > Before: > > mov z16.b, p0/z, #1 > fmov x0, d16 > orr x0, x0, x0, lsr #7 > orr x0, x0, x0, lsr #14 > orr x0, x0, x0, lsr #28 > and x0, x0, #0xff > fmov x8, v16.d[1] > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #8 > > orr x8, xzr, #0x2 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #16 > > orr x8, xzr, #0x3 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #24 > > orr x8, xzr, #0x4 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #32 > > mov x8, #0x5 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #40 > > orr x8, xzr, #0x6 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #48 > > orr x8, xzr, #0x7 > whilele p1.d, xzr, x8 > lastb x8, p1, z16.d > orr x8, x8, x8, lsr #7 > orr x8, x8, x8, lsr #14 > orr x8, x8, x8, lsr #28 > and x8, x8, #0xff > orr x0, x0, x8, lsl #56 > > After: > > mov z16.b, p0/z, #1 > mov z17.b, #1 > bext z16.d, z16.d, z17.d > mov z17.d, #0 > uzp1 z16.s, z16.s, z17.s > uzp1 z16.h, z16.h, z17.h > uzp1 z16.b, z16.b, z17.b > mov x0, v16.d[0] > > [1] https://www.felixcloutier.com/x86/pmovmskb > [2] https://developer.arm.com/documentation/ddi0602/2020-12/SVE-Instructions/BEXT--Gather-lower-bits-from-positions-selected-by-bitmask- This pull request has now been integrated. Changeset: e9f45bb2 Author: Eric Liu Committer: Xiaohong Gong URL: https://git.openjdk.java.net/jdk/commit/e9f45bb270c832ea6cba52bef73e969eb78dddce Stats: 118 lines in 7 files changed: 66 ins; 2 del; 50 mod 8282966: AArch64: Optimize VectorMask.toLong with SVE2 Reviewed-by: xgong, ngasson ------------- PR: https://git.openjdk.java.net/jdk/pull/8337 From dholmes at openjdk.java.net Thu May 12 03:37:53 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 12 May 2022 03:37:53 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 17:51:31 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 26 commits: > > - Block async exceptions during upcalls > - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 > - Fix use of rt_call > - Migrate to GrowableArray > - Address some review comments > - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 > - Remove unneeded ComputeMoveOrder > - Remove comment about native calls in lcm.cpp > - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 > > Reviewed-by: jvernee, mcimadamore > - Update riscv and arm stubs > - ... and 16 more: https://git.openjdk.java.net/jdk/compare/cdd006e7...b29ad8f4 src/hotspot/cpu/aarch64/foreign_globals_aarch64.cpp line 3: > 1: /* > 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved. > 3: * Copyright (c) 2019, 2021, Arm Limited. All rights reserved. Only update third-party copyrights under direction from that copyright holder. This may not be the correct format for example. Also it is 2022. :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From dholmes at openjdk.java.net Thu May 12 03:37:54 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 12 May 2022 03:37:54 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> Message-ID: <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> On Wed, 11 May 2022 20:27:42 GMT, Patricio Chilano Mateo wrote: >> I went ahead and implemented this suggestion. Now we block async exceptions in on_entry, and unblock in on_exit. > > Is it possible for these upcalls to be nested? If yes, we could add a boolean to context to avoid unsetting the flag in those nested cases. And now that I think we should probably add that check in NoAsyncExceptionDeliveryMark too if we allow broader use of this flag. David added the NoAsyncExceptionDeliveryMark code with that assert about nesting so maybe he might have more insights about that. NoAsyncExceptionDeliveryMark is not for general use! There is no provision for blocking async exceptions when running user-defined Java code. NoAsyncExceptionDeliveryMark was purely for protecting "system Java code". ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From xgong at openjdk.java.net Thu May 12 03:42:40 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 12 May 2022 03:42:40 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Wed, 11 May 2022 19:45:55 GMT, Paul Sandoz wrote: > I tried your test code with the patch and logged compilation (`-XX:-TieredCompilation -XX:+PrintCompilation -XX:+PrintInlining -XX:+PrintIntrinsics -Xbatch`) > > For `func` the first call to `VectorSupport::loadMasked` is intrinsic and inlined: > > ``` > @ 45 jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes) (intrinsic) > ``` > > But the second call (for the last loop iteration) fails to inline: > > ``` > @ 45 jdk.internal.vm.vector.VectorSupport::loadMasked (40 bytes) failed to inline (intrinsic) > ``` > > Since i am running on an mac book this looks right and aligns with the `-XX:+PrintIntrinsics` output: > > ``` > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it > ** not supported: op=loadMasked vlen=8 etype=int using_byte_array=0 > ``` > > ? > > I have not looked at the code gen nor measured performance comparing the case when never out of bounds and only out of bounds for the last loop iteration. Yeah, it looks right from the log. Did you try to find whether there is the log `** missing constant: offsetInRange=Parm` with `XX:+PrintIntrinsics` ? Or insert an assertion in `vectorIntrinsics.cpp` like: --- a/src/hotspot/share/opto/vectorIntrinsics.cpp +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp @@ -1236,6 +1236,7 @@ bool LibraryCallKit::inline_vector_mem_masked_operation(bool is_store) { } else { // Masked vector load with IOOBE always uses the predicated load. const TypeInt* offset_in_range = gvn().type(argument(8))->isa_int(); + assert(offset_in_range->is_con(), "must be a constant"); if (!offset_in_range->is_con()) { if (C->print_intrinsics()) { tty->print_cr(" ** missing constant: offsetInRange=%s", And run the tests with debug mode. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From rcastanedalo at openjdk.java.net Thu May 12 07:08:08 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 12 May 2022 07:08:08 GMT Subject: Integrated: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Fri, 6 May 2022 10:42:52 GMT, Roberto Casta?eda Lozano wrote: > This changeset lowers the priority of locally-dependent CreateEx nodes, that is CreateEx nodes that are not initially ready for scheduling in LCM. The proposed scheme assigns them the same priority as projection nodes when selecting the next node to be scheduled, restoring the relative prioritization between projections and CreateEx nodes to the state it was before [JDK-8270090](https://bugs.openjdk.java.net/browse/JDK-8270090). JDK-8270090 wrongly gave all CreateEx nodes the highest priority, which leads to failures whenever projection nodes are expected to get higher priority than locally-dependent CreateEx nodes. See the [JBS issue report](https://bugs.openjdk.java.net/browse/JDK-8285820) for further detail. > > More specifically, the current ranking to select the next node to be scheduled in `PhaseCFG::select()` is: > > 1. CreateEx nodes (initially ready or not) > 2. Projections > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > After this changeset, the ranking becomes: > > 1. Initially ready CreateEx nodes > 2. Projections and other CreateEx nodes (tie) > 3. Constants and CheckCastPP nodes (tie) > 4. ... > > which still addresses the issue handled by JDK-8270090 but in a form that is closer to the original ranking before JDK-8270090: > > 1. Initially ready CreateEx nodes > 2. Projections, other CreateEx nodes, constants and CheckCastPP nodes (tie) > 3. ... > > This changeset implements the minimal changes to restore the relative prioritization between CreateEx nodes and projections to the state it was before JDK-8270090, for risk minimization and ease of backporting. I will file a separate RFE proposing a more robust alternative than altering the order of the LCM worklist for ensuring that initially ready CreateEx nodes are scheduled at the block start. > > #### Testing > > ##### Functionality > > - Original failure on x86_32 using `-XX:+UseShenandoahGC` (thanks to Aleksey Shipilev for testing). > - Original failure of JDK-8270090 on arm32 (thanks to Marc Hoffmann for testing). > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; release and debug mode). > - hs-tier1-3 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode) with StressLCM and StressGCM (5 different seeds). > > ##### Performance > > Tested performance on a set of standard benchmark suites (DaCapo, SPECjbb2015, SPECjvm2008, ...) and on linux-x64, linux-aarch64, windows-x64, and macosx-x64. No significant regression was observed. This pull request has now been integrated. Changeset: 89392fb1 Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/89392fb15e9652b7b562b3511f79bda725c5499c Stats: 30 lines in 1 file changed: 11 ins; 9 del; 10 mod 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 Co-authored-by: Aleksey Shipilev Reviewed-by: thartmann, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8568 From fgao at openjdk.java.net Thu May 12 07:11:20 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 12 May 2022 07:11:20 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v5] In-Reply-To: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> Message-ID: <1B3doPHzaGFUCT_qkYIrlYzBgvs_nbEzjKcAlPSZTeM=.15f00dcc-9f2d-45fa-834b-2c2a129b149e@github.com> > After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. > > After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). > > After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. > > Here is the test data (-XX:+UseSuperWord) on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' into fg8283091 Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f - Merge branch 'master' into fg8283091 Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 - Add micro-benchmark cases Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 - Merge branch 'master' into fg8283091 Change-Id: I674581135fd0844accc65520574fcef161eededa - 8283091: Support type conversion between different data sizes in SLP After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: int <-> double float <-> long int <-> long float <-> double A typical test case: int[] a; double[] b; for (int i = start; i < limit; i++) { b[i] = (double) a[i]; } Our expected OptoAssembly code for one iteration is like below: add R12, R2, R11, LShiftL #2 vector_load V16,[R12, #16] vectorcast_i2d V16, V16 # convert I to D vector add R11, R1, R11, LShiftL #3 # ptr add R13, R11, #16 # ptr vector_store [R13], V16 To enable the vectorization, the patch solves the following problems in the SLP. There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed as a pair as well. Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use. After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. Here is the test data on NEON: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op perf data on X86: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op perf data on AVX512: Before the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef ------------- Changes: https://git.openjdk.java.net/jdk/pull/7806/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7806&range=04 Stats: 1140 lines in 15 files changed: 1092 ins; 13 del; 35 mod Patch: https://git.openjdk.java.net/jdk/pull/7806.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7806/head:pull/7806 PR: https://git.openjdk.java.net/jdk/pull/7806 From eosterlund at openjdk.java.net Thu May 12 07:37:20 2022 From: eosterlund at openjdk.java.net (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 12 May 2022 07:37:20 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom Message-ID: The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. ------------- Commit messages: - 8284404: Too aggressive sweeping with Loom Changes: https://git.openjdk.java.net/jdk/pull/8673/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8673&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284404 Stats: 4 lines in 1 file changed: 3 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8673.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8673/head:pull/8673 PR: https://git.openjdk.java.net/jdk/pull/8673 From rcastanedalo at openjdk.java.net Thu May 12 07:39:46 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 12 May 2022 07:39:46 GMT Subject: RFR: 8285820: C2: LCM prioritizes locally dependent CreateEx nodes over projections after 8270090 In-Reply-To: References: Message-ID: On Fri, 6 May 2022 10:42:52 GMT, Roberto Casta?eda Lozano wrote: > I will file a separate RFE proposing a more robust alternative than altering the order of the LCM worklist for ensuring that initially ready CreateEx nodes are scheduled at the block start. Here is the RFE: https://bugs.openjdk.java.net/browse/JDK-8286622. ------------- PR: https://git.openjdk.java.net/jdk/pull/8568 From ngasson at openjdk.java.net Thu May 12 08:27:00 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Thu, 12 May 2022 08:27:00 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: References: Message-ID: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> On Thu, 12 May 2022 03:34:19 GMT, David Holmes wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 26 commits: >> >> - Block async exceptions during upcalls >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Fix use of rt_call >> - Migrate to GrowableArray >> - Address some review comments >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - Remove unneeded ComputeMoveOrder >> - Remove comment about native calls in lcm.cpp >> - 8284072: foreign/StdLibTest.java randomly crashes on MacOS/AArch64 >> >> Reviewed-by: jvernee, mcimadamore >> - Update riscv and arm stubs >> - ... and 16 more: https://git.openjdk.java.net/jdk/compare/cdd006e7...b29ad8f4 > > src/hotspot/cpu/aarch64/foreign_globals_aarch64.cpp line 3: > >> 1: /* >> 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved. >> 3: * Copyright (c) 2019, 2021, Arm Limited. All rights reserved. > > Only update third-party copyrights under direction from that copyright holder. This may not be the correct format for example. > Also it is 2022. :) I think the Arm Ltd one was probably changed by me in one of the PRs in the Panama repo. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From aph at openjdk.java.net Thu May 12 09:13:47 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 12 May 2022 09:13:47 GMT Subject: RFR: 8283085: AArch64: No need to leave a breadcrumb for JavaFrameAnchor::capture_last_Java_pc when leaf call In-Reply-To: References: Message-ID: On Tue, 15 Mar 2022 09:22:14 GMT, Denghui Dong wrote: > Hi, > > Could I have a review of this fix that remove the breadcrumb in `aarch64_enc_java_to_runtime` for `Op_CallLeaf`, `Op_CallLeafNoFP` and `Op_CallNative`. > > For more details please refer to the description of the issue. > > Thanks, > Denghui Hi, I see this one has timed out. I'm sorry I didn't reply. I think you're probably right, but it would also require testing with asynchronous stack tracing, attached JVMTI, and so on. ------------- PR: https://git.openjdk.java.net/jdk/pull/7815 From jvernee at openjdk.java.net Thu May 12 09:28:59 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 09:28:59 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v11] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Revert "Block async exceptions during upcalls" This reverts commit b29ad8f46732666f2d07e63ce8701b1eb7bed790. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/b29ad8f4..1c04a42e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=09-10 Stats: 25 lines in 3 files changed: 0 ins; 21 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 09:29:00 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 09:29:00 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> Message-ID: On Thu, 12 May 2022 03:32:15 GMT, David Holmes wrote: >> Is it possible for these upcalls to be nested? If yes, we could add a boolean to context to avoid unsetting the flag in those nested cases. And now that I think we should probably add that check in NoAsyncExceptionDeliveryMark too if we allow broader use of this flag. David added the NoAsyncExceptionDeliveryMark code with that assert about nesting so maybe he might have more insights about that. > > NoAsyncExceptionDeliveryMark is not for general use! There is no provision for blocking async exceptions when running user-defined Java code. NoAsyncExceptionDeliveryMark was purely for protecting "system Java code". Okay, I see. I think I acted a little too hastily on this yesterday. I'll revert the change that uses this blocking mechanism. The stack more or less looks like this during an upcall: | --- | | | --- | <1: user define try block with exception handler (maybe)> | | --- | <2: user code start> | | --- | <3: method handle impl frames 1> | | --- | <4: upcall wrapper class with fallback handler 1> | | --- | <5: method handle impl frames 2> | | --- | <6: upcallk stub with fallback handler 2> | | <7: unknown native code> | --- | | I think there are several options to address async exceptions: 1. Do nothing special for async exceptions. i.e. if they happen anywhere between 1. and 6. they will up in one of the fallback handlers and the VM will be terminated. 2. Block async exceptions in all code up from 6. 3. Somehow only block async exceptions only between 6. and 1. I think that is possible by changing the API so that the user passes us a method handle to their fallback exception handler. We would need 2 methods for blocking and unblocking async exceptions from Java. Then we could disable async exceptions at the start of 6. enabled them at the start of the try block in 4. (around the call to user code), and disable them at the end of this try block. Then finally re-enable them at the end of 6. If an exception occurs in the try block in 4., delegate to the user-defined exception handler (but use the VM terminate strategy as a fallback for when another exception occurs). The other problem I see with that is that to make that fast enough (i.e. not incur a ~1.5-2x cost on call overhead), we would need compiler intrinsics for the blocking/unblocking, and in the past we've been unable to define 'critical' sections of code like that in C2 (it's an unsolved problem at this point). ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 09:31:36 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 09:31:36 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> References: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> Message-ID: On Thu, 12 May 2022 08:23:11 GMT, Nick Gasson wrote: >> src/hotspot/cpu/aarch64/foreign_globals_aarch64.cpp line 3: >> >>> 1: /* >>> 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved. >>> 3: * Copyright (c) 2019, 2021, Arm Limited. All rights reserved. >> >> Only update third-party copyrights under direction from that copyright holder. This may not be the correct format for example. >> Also it is 2022. :) > > I think the Arm Ltd one was probably changed by me in one of the PRs in the Panama repo. Right, I brought in these commits from the Panama repo, so this is from Nick's updates there (specifically from this PR: https://github.com/openjdk/panama-foreign/pull/610). I'm not sure what to do here. I own the PR, but Nick is a contributor on it, so some of the changes are his (including this copyright year change). ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From mcimadamore at openjdk.java.net Thu May 12 12:12:55 2022 From: mcimadamore at openjdk.java.net (Maurizio Cimadamore) Date: Thu, 12 May 2022 12:12:55 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> Message-ID: On Thu, 12 May 2022 09:24:23 GMT, Jorn Vernee wrote: > Do nothing special for async exceptions. i.e. if they happen anywhere between 1. and 6. they will end up in one of the fallback handlers and the VM will be terminated. My understanding is that if they materialize while we're executing the upcall Java code, if that code has a try/catch block, we will go there, rather than crash the VM. In other words, IMHO the only problem with async exception is if they occur _after_ the Java user code has completed, because that will crash the Java adapter, this preventing it from returning to native call cleanly. So, either we disable async exceptions during that phase (e.g. after user code has executed, but before we return back to native code), or we just punt and stop. Since this seems like a corner^3 case, and since there are also other issue with upcalls that can occur if other threads do not cooperate (e.g. an upcall can get stuck into an infinite safepoint if the VM exits while an async native thread runs the upcall), and given that obtaining a linker is a restricted operation anyway, I don't think we should bend over backwards to try to add 1% more safety to something that's unavoidably sharp anyways. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From duke at openjdk.java.net Thu May 12 12:42:31 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 12 May 2022 12:42:31 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis Message-ID: `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. **Why is this important?** I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. **Detailed analysis of that case:** type `i: [minint...0]` access to `c[i-1]` **Range-check:** `int index = AddI(i, -1)` -> type index: [minint-1 ... -1] -> underflow We detect that this AddI may have 2 ranges: `tr1: int:<=-1` `tr2: int:max `(underflow: minint-1) We then check how these ranges compare to in2: `t2: int:>=0` For this we compute: `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] But then, we only do something with this result if `cmp1 == cmp2`. We never detect that the `Bool [lt] `could never be true. **Data-flow:** `long index = ConvI2L( AddI(i, -1) )` -> type of` ConvI2L: [0...maxint-1]` -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. Then there is a push_thru_add, and we get: `long index = AddL( ConvI2L(i), -1)` -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. Consequence: if hi is maxint or maxint-1, there is no overflow. Then, we statically detect that: type `i: [minint...0]` type` ConvI2L: [1...maxint-1]` -> filter results in `TOP` -> data-flow is eliminated sucessfully. Added **regression test** that matches this example above. Running larger test suite now... ------------- Commit messages: - fix whitespace - 8286638: C2: CmpU needs to do more precise over/underflow analysis Changes: https://git.openjdk.java.net/jdk/pull/8679/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8679&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286638 Stats: 67 lines in 2 files changed: 62 ins; 1 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8679.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8679/head:pull/8679 PR: https://git.openjdk.java.net/jdk/pull/8679 From duke at openjdk.java.net Thu May 12 12:49:48 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 12 May 2022 12:49:48 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis [v2] In-Reply-To: References: Message-ID: > `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. > Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. > > We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. > If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. > > However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. > > Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. > > **Why is this important?** > I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. > > **Detailed analysis of that case:** > > type `i: [minint...0]` > access to `c[i-1]` > > **Range-check:** > `int index = AddI(i, -1)` > -> type index: [minint-1 ... -1] -> underflow > We detect that this AddI may have 2 ranges: > `tr1: int:<=-1` > `tr2: int:max `(underflow: minint-1) > > We then check how these ranges compare to in2: > `t2: int:>=0` > > For this we compute: > `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] > `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] > > But then, we only do something with this result if `cmp1 == cmp2`. > We never detect that the `Bool [lt] `could never be true. > > > **Data-flow:** > `long index = ConvI2L( AddI(i, -1) )` > -> type of` ConvI2L: [0...maxint-1]` > -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. > Then there is a push_thru_add, and we get: > `long index = AddL( ConvI2L(i), -1)` > -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. > Consequence: if hi is maxint or maxint-1, there is no overflow. > Then, we statically detect that: > type `i: [minint...0]` > type` ConvI2L: [1...maxint-1]` > -> filter results in `TOP` -> data-flow is eliminated sucessfully. > > > Added **regression test** that matches this example above. > Running larger test suite now... Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: removed useless test case with fixed seed ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8679/files - new: https://git.openjdk.java.net/jdk/pull/8679/files/a42dfe19..7d343308 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8679&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8679&range=00-01 Stats: 5 lines in 1 file changed: 0 ins; 5 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8679.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8679/head:pull/8679 PR: https://git.openjdk.java.net/jdk/pull/8679 From dholmes at openjdk.java.net Thu May 12 13:12:04 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 12 May 2022 13:12:04 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> References: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> Message-ID: On Thu, 12 May 2022 08:23:11 GMT, Nick Gasson wrote: >> src/hotspot/cpu/aarch64/foreign_globals_aarch64.cpp line 3: >> >>> 1: /* >>> 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved. >>> 3: * Copyright (c) 2019, 2021, Arm Limited. All rights reserved. >> >> Only update third-party copyrights under direction from that copyright holder. This may not be the correct format for example. >> Also it is 2022. :) > > I think the Arm Ltd one was probably changed by me in one of the PRs in the Panama repo. That's fine, just needed to check. But as these are now being committed to mainline the year does need to change to 2022 - at least in Oracle copyright. @nick-arm will have to advise what he thinks should be done with the ARM copyright. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From ngasson at openjdk.java.net Thu May 12 13:59:11 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Thu, 12 May 2022 13:59:11 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: References: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> Message-ID: On Thu, 12 May 2022 13:07:24 GMT, David Holmes wrote: >> I think the Arm Ltd one was probably changed by me in one of the PRs in the Panama repo. > > That's fine, just needed to check. But as these are now being committed to mainline the year does need to change to 2022 - at least in Oracle copyright. @nick-arm will have to advise what he thinks should be done with the ARM copyright. I think the Arm line can be updated to 2022 at the same time. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 14:53:03 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 14:53:03 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v12] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Update Oracle copyright years ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/1c04a42e..9a7bb6bb Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=11 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=10-11 Stats: 70 lines in 70 files changed: 0 ins; 0 del; 70 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From ngasson at openjdk.java.net Thu May 12 15:00:16 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Thu, 12 May 2022 15:00:16 GMT Subject: RFR: 8286596: AArch64: -XX:UseBranchProtection=pac-ret crashes after JDK-8284161 Message-ID: `RegisterSaver::restore_live_registers()` used to call `__ leave()` but after the Loom integration it directly pops LR/FP from the stack. With `-XX:UseBranchProtection=pac-ret` we need a call to `__ authenticate_return_address()` here to insert the AUTIA instruction to check and strip the PAC code from the saved LR. Tested `java -XX:UseBranchProtection=pac-ret -version` on a machine that supports PAC, plus tier1. Note that some additional fixes will be required to support virtual threads with PAC enabled. ------------- Commit messages: - 8286596: AArch64: -XX:UseBranchProtection=pac-ret crashes after JDK-8284161 Changes: https://git.openjdk.java.net/jdk/pull/8682/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8682&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286596 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8682.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8682/head:pull/8682 PR: https://git.openjdk.java.net/jdk/pull/8682 From jvernee at openjdk.java.net Thu May 12 15:03:05 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 15:03:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: References: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> Message-ID: On Thu, 12 May 2022 13:55:20 GMT, Nick Gasson wrote: >> That's fine, just needed to check. But as these are now being committed to mainline the year does need to change to 2022 - at least in Oracle copyright. @nick-arm will have to advise what he thinks should be done with the ARM copyright. > > I think the Arm line can be updated to 2022 at the same time. I've updated all the Oracle copyright years in the files touched by this PR. @nick-arm If you wouldn't mind, could you use the "Add a suggestion" feature (the +/- button when leaving a review comment) to suggest copyright year updates to the relevant files (I think it's just foreign_globals_aarch64.cpp and universalUpcallHandler_aarch64.cpp). That way, when I accept those suggestions, an automatic commit will be made with you as the commit co-author, and it's even clearer that I'm making that change on your behalf. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 15:07:57 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 15:07:57 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v13] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Missed 2 years ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/9a7bb6bb..8100e0a7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=12 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=11-12 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From ngasson at openjdk.java.net Thu May 12 15:08:01 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Thu, 12 May 2022 15:08:01 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v12] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 14:53:03 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > Update Oracle copyright years src/hotspot/cpu/aarch64/foreign_globals_aarch64.cpp line 3: > 1: /* > 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved. > 3: * Copyright (c) 2019, 2021, Arm Limited. All rights reserved. Suggestion: * Copyright (c) 2019, 2022, Arm Limited. All rights reserved. src/hotspot/cpu/aarch64/universalUpcallHandler_aarch64.cpp line 3: > 1: /* > 2: * Copyright (c) 2020, 2021, Oracle and/or its affiliates. All rights reserved. > 3: * Copyright (c) 2019, 2021, Arm Limited. All rights reserved. Suggestion: * Copyright (c) 2019, 2022, Arm Limited. All rights reserved. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 15:16:08 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 15:16:08 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v14] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Apply copyright year updates per request of @nick-arm Co-authored-by: Nick Gasson ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/8100e0a7..0f49ff0b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=13 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=12-13 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 15:16:10 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 15:16:10 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v10] In-Reply-To: References: <61uaOdBJDL6FscYySqo_tZCDdBiitl8AwNYqmV1w-aU=.80f61175-19aa-4bfc-9301-e5930d54b0a2@github.com> Message-ID: On Thu, 12 May 2022 13:55:20 GMT, Nick Gasson wrote: >> That's fine, just needed to check. But as these are now being committed to mainline the year does need to change to 2022 - at least in Oracle copyright. @nick-arm will have to advise what he thinks should be done with the ARM copyright. > > I think the Arm line can be updated to 2022 at the same time. @nick-arm Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From duke at openjdk.java.net Thu May 12 15:35:27 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 12 May 2022 15:35:27 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v7] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: refactoring into class, no lambdas ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/4b7e3b1d..d6d2666d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=05-06 Stats: 393 lines in 1 file changed: 163 ins; 141 del; 89 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Thu May 12 15:52:26 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 12 May 2022 15:52:26 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v8] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Merge master - refactoring into class, no lambdas - missed in last commit - fix int / long issue, and improve idx printing for paths - kill trailing whitespaces - refactoring, and making edge its own colunm - small refactoring and some beautification/comments - refactored print_bfs to be member function of Node - some white spaces fixed - fixing up root in shortest path backtracking - ... and 1 more: https://git.openjdk.java.net/jdk/compare/82aa0455...4bb56ba4 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8468/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=07 Stats: 322 lines in 2 files changed: 322 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From jvernee at openjdk.java.net Thu May 12 15:58:04 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 15:58:04 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v15] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with two additional commits since the last revision: - Merge branch 'JEP-19-VM-IMPL2' of https://github.com/JornVernee/jdk into JEP-19-VM-IMPL2 - Fix overwritten copyright years. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/0f49ff0b..aab2d15c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=14 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=13-14 Stats: 36 lines in 33 files changed: 4 ins; 0 del; 32 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 16:01:17 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 16:01:17 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v16] In-Reply-To: References: Message-ID: <1WJgr8j7aHe_iD2t5KwVDk3c-raW7Q0NEbaSNfeo3zA=.5b14ab8c-b2a3-4fba-8209-643228a8b85d@github.com> > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Undo spurious changes. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/aab2d15c..f961121a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=15 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=14-15 Stats: 4 lines in 1 file changed: 0 ins; 4 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From kvn at openjdk.java.net Thu May 12 16:10:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 12 May 2022 16:10:42 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 07:30:39 GMT, Erik ?sterlund wrote: > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. Did you run our regular performance testing with loom to see how this change affect performance? Why 10x and not other number? ------------- PR: https://git.openjdk.java.net/jdk/pull/8673 From psandoz at openjdk.java.net Thu May 12 16:11:58 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 12 May 2022 16:11:58 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 08:56:07 GMT, Xiaohong Gong wrote: >> Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. >> >> This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): >> >> >> @ForceInline >> public static >> ByteVector fromArray(VectorSpecies species, >> byte[] a, int offset, >> VectorMask m) { >> ByteSpecies vsp = (ByteSpecies) species; >> if (offset >= 0 && offset <= (a.length - species.length())) { >> return vsp.dummyVector().fromArray0(a, offset, m); >> } >> >> // FIXME: optimize >> checkMaskFromIndexSize(offset, vsp, m, 1, a.length); >> return vsp.vOp(m, i -> a[offset + i]); >> } >> >> Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. >> >> Also adds the same vectorization support for masked: >> - fromByteArray/fromByteBuffer >> - fromBooleanArray >> - fromCharArray >> >> The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: >> >> Benchmark before After Units >> LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms >> LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms >> LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms >> LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms >> LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms >> LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms >> >> Similar performance gain can also be observed on 512-bit SVE system. > > Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: > > Rename "use_predicate" to "needs_predicate" Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it ** missing constant: offsetInRange=Parm @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From aph at openjdk.java.net Thu May 12 16:13:43 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Thu, 12 May 2022 16:13:43 GMT Subject: RFR: 8286596: AArch64: -XX:UseBranchProtection=pac-ret crashes after JDK-8284161 In-Reply-To: References: Message-ID: On Thu, 12 May 2022 14:52:18 GMT, Nick Gasson wrote: > `RegisterSaver::restore_live_registers()` used to call `__ leave()` but after the Loom integration it directly pops LR/FP from the stack. With `-XX:UseBranchProtection=pac-ret` we need a call to `__ authenticate_return_address()` here to insert the AUTIA instruction to check and strip the PAC code from the saved LR. > > Tested `java -XX:UseBranchProtection=pac-ret -version` on a machine that supports PAC, plus tier1. Note that some additional fixes will be required to support virtual threads with PAC enabled. Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8682 From duke at openjdk.java.net Thu May 12 16:26:41 2022 From: duke at openjdk.java.net (openjdk-notifier[bot]) Date: Thu, 12 May 2022 16:26:41 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v16] In-Reply-To: <1WJgr8j7aHe_iD2t5KwVDk3c-raW7Q0NEbaSNfeo3zA=.5b14ab8c-b2a3-4fba-8209-643228a8b85d@github.com> References: <1WJgr8j7aHe_iD2t5KwVDk3c-raW7Q0NEbaSNfeo3zA=.5b14ab8c-b2a3-4fba-8209-643228a8b85d@github.com> Message-ID: On Thu, 12 May 2022 16:01:17 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > Undo spurious changes. The dependent pull request has now been integrated, and the target branch of this pull request has been updated. This means that changes from the dependent pull request can start to show up as belonging to this pull request, which may be confusing for reviewers. To remedy this situation, simply merge the latest changes from the new target branch into this pull request by running commands similar to these in the local repository for your personal fork: git checkout JEP-19-VM-IMPL2 git fetch https://git.openjdk.java.net/jdk master git merge FETCH_HEAD # if there are conflicts, follow the instructions given by git merge git commit -m "Merge master" git push ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Thu May 12 16:58:36 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 16:58:36 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v17] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 98 commits: - Merge branch 'master' into JEP-19-VM-IMPL2 - Undo spurious changes. - Merge branch 'JEP-19-VM-IMPL2' of https://github.com/JornVernee/jdk into JEP-19-VM-IMPL2 - Apply copyright year updates per request of @nick-arm Co-authored-by: Nick Gasson - Fix overwritten copyright years. - Missed 2 years - Update Oracle copyright years - Revert "Block async exceptions during upcalls" This reverts commit b29ad8f46732666f2d07e63ce8701b1eb7bed790. - Block async exceptions during upcalls - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 - ... and 88 more: https://git.openjdk.java.net/jdk/compare/2c5d1362...f55b6c59 ------------- Changes: https://git.openjdk.java.net/jdk/pull/7959/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=16 Stats: 6913 lines in 155 files changed: 2576 ins; 3219 del; 1118 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From duke at openjdk.java.net Thu May 12 17:08:20 2022 From: duke at openjdk.java.net (brianjstafford) Date: Thu, 12 May 2022 17:08:20 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() Message-ID: The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: - Asserted the condition that makes sure it runs at least once - Asserted the condition that makes sure it runs only once - Removed the `while` loop - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. ------------- Commit messages: - Removed whitespace - This change simplifies the anti-dependence check in PhaseCFG::implicit_null_check(). JDK-8263075 Changes: https://git.openjdk.java.net/jdk/pull/8684/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8263075 Stats: 22 lines in 1 file changed: 3 ins; 0 del; 19 mod Patch: https://git.openjdk.java.net/jdk/pull/8684.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8684/head:pull/8684 PR: https://git.openjdk.java.net/jdk/pull/8684 From jvernee at openjdk.java.net Thu May 12 17:30:00 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 12 May 2022 17:30:00 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> Message-ID: <9NhIJsBLpV42NNz7rjhBu_cEvljMy1KIAA7IdTz1aGM=.1ac390e6-4834-4616-b85d-fda842c8e4fa@github.com> On Thu, 12 May 2022 12:10:53 GMT, Maurizio Cimadamore wrote: >> Okay, I see. I think I acted a little too hastily on this yesterday. I'll revert the change that uses this blocking mechanism. >> >> The stack more or less looks like this during an upcall: >> >> >> | --- >> | | >> | --- >> | <1: user define try block with exception handler (maybe)> | >> | --- >> | <2: user code start> | >> | --- >> | <3: method handle impl frames 1> | >> | --- >> | <4: upcall wrapper class with fallback handler 1> | >> | --- >> | <5: method handle impl frames 2> | >> | --- >> | <6: upcallk stub with fallback handler 2> | >> | <7: unknown native code> >> | --- >> | | >> >> >> I think there are several options to address async exceptions: >> 1. Do nothing special for async exceptions. i.e. if they happen anywhere between 1. and 6. they will end up in one of the fallback handlers and the VM will be terminated. >> 2. Block async exceptions in all code up from 6. >> 3. Somehow only block async exceptions only between 6. and 1. >> I think that is possible by changing the API so that the user passes us a method handle to their fallback exception handler, and we invoke it in our code in 4. We would need 2 methods for blocking and unblocking async exceptions from Java. Then we could disable async exceptions at the start of 6. enabled them at the start of the try block in 4. (around the call to user code), and disable them at the end of this try block. Then finally re-enable them at the end of 6. If an exception occurs in the try block in 4., delegate to the user-defined exception handler (but use the VM terminate strategy as a fallback for when another exception occurs). The other problem I see with that is that to make that fast enough (i.e. not incur a ~1.5-2x cost on call overhead), we would need compiler intrinsics for the blocking/unblocking, and in the past we've been unable to define 'critical' sections of code like that in C2 (it's an unsolved problem at this point). > >> Do nothing special for async exceptions. i.e. if they happen anywhere between 1. and 6. they will end up in one of the fallback handlers and the VM will be terminated. > > My understanding is that if they materialize while we're executing the upcall Java code, if that code has a try/catch block, we will go there, rather than crash the VM. > > In other words, IMHO the only problem with async exception is if they occur _after_ the Java user code has completed, because that will crash the Java adapter, this preventing it from returning to native call cleanly. > > So, either we disable async exceptions during that phase (e.g. after user code has executed, but before we return back to native code), or we just punt and stop. Since this seems like a corner^3 case, and since there are also other issue with upcalls that can occur if other threads do not cooperate (e.g. an upcall can get stuck into an infinite safepoint if the VM exits while an async native thread runs the upcall), and given that obtaining a linker is a restricted operation anyway, I don't think we should bend over backwards to try to add 1% more safety to something that's unavoidably sharp anyways. Ok. Then, if no one objects, I will leave this area as-is for now. (and perhaps come back to this issue in the future, if it becomes more pressing). ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From eosterlund at openjdk.java.net Thu May 12 17:37:00 2022 From: eosterlund at openjdk.java.net (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Thu, 12 May 2022 17:37:00 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:07:11 GMT, Vladimir Kozlov wrote: > Did you run our regular performance testing with loom to see how this change affect performance? > > Why 10x and not other number? I tried to find a problematic workload where tis is a real issue and and manually found that jython compiles a lot of methods yielding a lot of sweeper triggered GCs. This new threshold remedied the problem so that only a few were triggered. I have however not run the wider perf suite. I can do that though. ------------- PR: https://git.openjdk.java.net/jdk/pull/8673 From kvn at openjdk.java.net Thu May 12 18:30:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 12 May 2022 18:30:44 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 07:30:39 GMT, Erik ?sterlund wrote: > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. Thank you for explaining where the number came from. I think `SweeperThreshold` should still be limited to some reasonable number. Otherwise this code may set it to `1000.0` (flag's max allowed value is 100.). ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8673 From xliu at openjdk.java.net Thu May 12 21:27:30 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 12 May 2022 21:27:30 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with 11 additional commits since the last revision: - revert code change from 1st revision. - Merge branch 'JDK-8276998' into JDK-8286104 - rule out if a If nodes has 2 branches of unstable_if trap. - change the flag to diagnostic. - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull - fix release build - update unstable_if after igvn. - adjust unstable_if after fold_compares - disable comparison_folding temporarily. This feature not only folds two CMPI but also merge two uncommon_traps. it uses the dominating uncommon_trap and revaluate the two if in interpreter. currently, aggressiveliveness can't work for that. - retain bci for unstable_if - ... and 1 more: https://git.openjdk.java.net/jdk/compare/2c38b87b...2f047457 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/2c38b87b..2f047457 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=00-01 Stats: 156 lines in 13 files changed: 111 ins; 24 del; 21 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From vlivanov at openjdk.java.net Thu May 12 23:59:46 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Thu, 12 May 2022 23:59:46 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Tue, 10 May 2022 12:48:25 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Correcting a typo. > - 8284960: Integrating changes from panama-vector (Add @since 19 tags). > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: AARCH64 backend changes. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082 Overall, looks good. Some minor questions/suggestions follow. src/hotspot/cpu/aarch64/aarch64_neon.ad line 5700: > 5698: as_FloatRegister($dst$$reg)); > 5699: } > 5700: if (bt == T_INT) { I find it hard to reason about the code in its current form. Maybe make the second `if` (`bt == T_INT`) nested and move it under `if (bt == T_SHORT || bt == T_INT)`? src/hotspot/cpu/x86/macroAssembler_x86.cpp line 2587: > 2585: > 2586: void MacroAssembler::vmovdqu(XMMRegister dst, AddressLiteral src, Register scratch_reg, int vector_len) { > 2587: assert(vector_len <= AVX_512bit, "unexpected vector length"); The assert becomes redundant. src/hotspot/cpu/x86/matcher_x86.hpp line 195: > 193: case Op_PopCountVI: > 194: return ((ety == T_INT && VM_Version::supports_avx512_vpopcntdq()) || > 195: (is_subword_type(ety) && VM_Version::supports_avx512_bitalg())) ? 0 : 50; Should be easier to read when the condition is split. E.g.: if (is_subword_type(ety)) { return VM_Version::supports_avx512_bitalg())) ? 0 : 50; } else { assert(ety == T_INT, "sanity"); // for documentation purposes return VM_Version::supports_avx512_vpopcntdq() ? 0 : 50; } src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7953: > 7951: StubRoutines::x86::_vector_iota_indices = generate_iota_indices("iota_indices"); > 7952: > 7953: if (UsePopCountInstruction && VM_Version::supports_avx2() && !VM_Version::supports_avx512_vpopcntdq()) { Why is the LUT unconditionally generated? `UsePopCountInstruction` still guides the usages. src/hotspot/cpu/x86/vm_version_x86.hpp line 375: > 373: decl(RDTSCP, "rdtscp", 48) /* RDTSCP instruction */ \ > 374: decl(RDPID, "rdpid", 49) /* RDPID instruction */ \ > 375: decl(FSRM, "fsrm", 50) /* Fast Short REP MOV */ \ `test/lib-test/jdk/test/whitebox/CPUInfoTest.java` should be adjusted as well, shouldn't it? src/hotspot/cpu/x86/x86.ad line 2113: > 2111: > 2112: case Op_CountLeadingZerosV: > 2113: if ((bt == T_INT || bt == T_LONG) && VM_Version::supports_avx512cd()) { Newly introduced `is_non_subword_integral_type(bt)` can be used here instead of `bt == T_INT || bt == T_LONG`. src/hotspot/share/classfile/vmIntrinsics.hpp line 1152: > 1150: "Ljdk/internal/vm/vector/VectorSupport$ComExpOperation;)" \ > 1151: "Ljdk/internal/vm/vector/VectorSupport$VectorPayload;") \ > 1152: do_name(vector_comexp_op_name, "comExpOp") \ I don't see much value in trying to shorten the name by abbreviating it. I find it easier to read in an expanded form: ` compressExpandOp`, `vector_compress_expand_op_name`, `_VectorCompressExpand`, etc. src/hotspot/share/opto/c2compiler.cpp line 521: > 519: if (!Matcher::match_rule_supported(Op_SignumF)) return false; > 520: break; > 521: case vmIntrinsics::_VectorComExp: Why `_VectorComExp` intrinsic is special? Other vector intrinsics are handled later and in a different manner. What about `ExpandV` case? src/hotspot/share/opto/compile.cpp line 3416: > 3414: > 3415: case Op_ReverseBytesV: > 3416: case Op_ReverseV: { Can you elaborate, please, why it is performed so late in the optimization phase (at the very end during graph reshaping) and not during GVN? ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From xgong at openjdk.java.net Fri May 13 01:52:44 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 13 May 2022 01:52:44 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: > Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: > > ``` > 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it > ** missing constant: offsetInRange=Parm > @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation > ``` > > So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). Make sense to me! I think I understand what you mean. I will have more tests with the integer constant change. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From dlong at openjdk.java.net Fri May 13 02:27:13 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 02:27:13 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest Message-ID: This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). ------------- Commit messages: - save 16-byte vectors Changes: https://git.openjdk.java.net/jdk/pull/8690/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8271078 Stats: 27 lines in 1 file changed: 1 ins; 1 del; 25 mod Patch: https://git.openjdk.java.net/jdk/pull/8690.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8690/head:pull/8690 PR: https://git.openjdk.java.net/jdk/pull/8690 From kvn at openjdk.java.net Fri May 13 03:09:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 13 May 2022 03:09:47 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest In-Reply-To: References: Message-ID: <-kBZstxHuNYypM3lf6ZPEEdSByhxvX-HNA0T9clYeyg=.dd27fb71-6ddf-4d07-a946-0909b97cd9df@github.com> On Fri, 13 May 2022 02:21:06 GMT, Dean Long wrote: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). Thank you for finding the cause and fixing the issue! ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Fri May 13 03:41:43 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 03:41:43 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest In-Reply-To: References: Message-ID: On Fri, 13 May 2022 02:21:06 GMT, Dean Long wrote: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). Some tests are failing with the new asserts. I need to investigate why. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Fri May 13 03:46:44 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 03:46:44 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest In-Reply-To: References: Message-ID: On Fri, 13 May 2022 02:21:06 GMT, Dean Long wrote: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). I forgot generate_deopt_blob() wants to save wide vectors unconditionally. I will revert the assert changes, as it's not critical for this fix. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Fri May 13 03:46:44 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 03:46:44 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest In-Reply-To: <-kBZstxHuNYypM3lf6ZPEEdSByhxvX-HNA0T9clYeyg=.dd27fb71-6ddf-4d07-a946-0909b97cd9df@github.com> References: <-kBZstxHuNYypM3lf6ZPEEdSByhxvX-HNA0T9clYeyg=.dd27fb71-6ddf-4d07-a946-0909b97cd9df@github.com> Message-ID: <2uo04RHGgMjH5NJ2Vs6n7A9MTjAvcxWwiGoWbhakzoU=.ce458cf7-84e6-4014-9853-4d2ffddffbd1@github.com> On Fri, 13 May 2022 03:06:32 GMT, Vladimir Kozlov wrote: > Thank you for finding the cause and fixing the issue! Thanks Vladimir! I'm a little surprised the code has been broken since 2015 and we never noticed until now. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Fri May 13 03:53:24 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 03:53:24 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v2] In-Reply-To: References: Message-ID: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). Dean Long has updated the pull request incrementally with one additional commit since the last revision: revert changes to asserts ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8690/files - new: https://git.openjdk.java.net/jdk/pull/8690/files/8709b627..e6a533a9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=00-01 Stats: 4 lines in 1 file changed: 1 ins; 1 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8690.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8690/head:pull/8690 PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Fri May 13 04:08:37 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 04:08:37 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). Dean Long has updated the pull request incrementally with one additional commit since the last revision: save_vectors --> save_wide_vectors ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8690/files - new: https://git.openjdk.java.net/jdk/pull/8690/files/e6a533a9..a2bc8306 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=01-02 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8690.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8690/head:pull/8690 PR: https://git.openjdk.java.net/jdk/pull/8690 From chagedorn at openjdk.java.net Fri May 13 07:52:42 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 13 May 2022 07:52:42 GMT Subject: RFR: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly Message-ID: This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: # PrintIdeal output of retainDenominator() 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator # Safepoint interruption # Enqueuing of another test class method identityThird() @ bci:4 (line 130) # Continue to dump PrintIdeal of retainDenominator() 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. Thanks, Christian ------------- Commit messages: - 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly Changes: https://git.openjdk.java.net/jdk/pull/8692/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8692&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284115 Stats: 160 lines in 6 files changed: 109 ins; 40 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8692.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8692/head:pull/8692 PR: https://git.openjdk.java.net/jdk/pull/8692 From jbhateja at openjdk.java.net Fri May 13 08:31:09 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 13 May 2022 08:31:09 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v4] In-Reply-To: References: Message-ID: <9BFz3-71uc1dcsLybF4_IGlQmh43DBdLkI6FEGxKTro=.d020993a-a112-46fe-9902-6c057918b700@github.com> > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8284960: Review comments resolution. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8425/files - new: https://git.openjdk.java.net/jdk/pull/8425/files/b021e082..adf205f9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=02-03 Stats: 121 lines in 49 files changed: 8 ins; 5 del; 108 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Fri May 13 08:31:22 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 13 May 2022 08:31:22 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Thu, 12 May 2022 22:40:50 GMT, Vladimir Ivanov wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: Correcting a typo. >> - 8284960: Integrating changes from panama-vector (Add @since 19 tags). >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: AARCH64 backend changes. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - ... and 1 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082 > > src/hotspot/cpu/x86/matcher_x86.hpp line 195: > >> 193: case Op_PopCountVI: >> 194: return ((ety == T_INT && VM_Version::supports_avx512_vpopcntdq()) || >> 195: (is_subword_type(ety) && VM_Version::supports_avx512_bitalg())) ? 0 : 50; > > Should be easier to read when the condition is split. E.g.: > > if (is_subword_type(ety)) { > return VM_Version::supports_avx512_bitalg())) ? 0 : 50; > } else { > assert(ety == T_INT, "sanity"); // for documentation purposes > return VM_Version::supports_avx512_vpopcntdq() ? 0 : 50; > } DONE > src/hotspot/cpu/x86/vm_version_x86.hpp line 375: > >> 373: decl(RDTSCP, "rdtscp", 48) /* RDTSCP instruction */ \ >> 374: decl(RDPID, "rdpid", 49) /* RDPID instruction */ \ >> 375: decl(FSRM, "fsrm", 50) /* Fast Short REP MOV */ \ > > `test/lib-test/jdk/test/whitebox/CPUInfoTest.java` should be adjusted as well, shouldn't it? Yes, test updated appropriately. > src/hotspot/share/classfile/vmIntrinsics.hpp line 1152: > >> 1150: "Ljdk/internal/vm/vector/VectorSupport$ComExpOperation;)" \ >> 1151: "Ljdk/internal/vm/vector/VectorSupport$VectorPayload;") \ >> 1152: do_name(vector_comexp_op_name, "comExpOp") \ > > I don't see much value in trying to shorten the name by abbreviating it. I find it easier to read in an expanded form: > ` compressExpandOp`, `vector_compress_expand_op_name`, `_VectorCompressExpand`, etc. DONE > src/hotspot/share/opto/c2compiler.cpp line 521: > >> 519: if (!Matcher::match_rule_supported(Op_SignumF)) return false; >> 520: break; >> 521: case vmIntrinsics::_VectorComExp: > > Why `_VectorComExp` intrinsic is special? Other vector intrinsics are handled later and in a different manner. > > What about `ExpandV` case? It was an attempt to facilitate in-lining of these APIs over targets which do not intrinsify them. I agree its not a generic fix since three APIs are piggybacking on same entry point and without the knowledge of opcode it will be inappropriate to take any call at this place, lazy intrinsification gives opportunity for some of the predications to concertize as compilation happens under closed world assumptions. > src/hotspot/share/opto/compile.cpp line 3416: > >> 3414: >> 3415: case Op_ReverseBytesV: >> 3416: case Op_ReverseV: { > > Can you elaborate, please, why it is performed so late in the optimization phase (at the very end during graph reshaping) and not during GVN? Its more of a chicken-egg problem here, for masked reverse operation, Reverse IR node is followed by a Blend Node, thus in such a case doing an eager Identity transform in Reverse::Identity will not work, also deferring this to blend may also not work since it could be a non-masked reverse operation, we could have handled it as a special case in inline_vector_nary_operation, but handling such special case in final graph reshaping looked more appropriate. https://github.com/openjdk/panama-vector/pull/182#discussion_r845678080 ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Fri May 13 08:31:24 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 13 May 2022 08:31:24 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v4] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Thu, 12 May 2022 22:48:26 GMT, Vladimir Ivanov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8284960: Review comments resolution. > > src/hotspot/cpu/x86/stubGenerator_x86_64.cpp line 7953: > >> 7951: StubRoutines::x86::_vector_iota_indices = generate_iota_indices("iota_indices"); >> 7952: >> 7953: if (UsePopCountInstruction && VM_Version::supports_avx2() && !VM_Version::supports_avx512_vpopcntdq()) { > > Why is the LUT unconditionally generated? `UsePopCountInstruction` still guides the usages. LUT should be generated only if UsePopCountInsturction is false and iff target does not support necessary features, AVX512POPCNTDQ (for int/long vectors) and AVX512_BITALG (for sub-word vectors). Please refer to following discussion where it was suggested to restrict the scope of flag to only scalar popcount operation. https://github.com/openjdk/panama-vector/pull/185#discussion_r847758463 ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From xgong at openjdk.java.net Fri May 13 09:01:49 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 13 May 2022 09:01:49 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v3] In-Reply-To: References: Message-ID: <_c_QPZQIL-ZxBs9TaKmrh7_1WcbEDH1pUwhTpOc6PD8=.75e4a61b-ebb6-491c-9c5b-9a035f0b9eaf@github.com> On Thu, 12 May 2022 16:07:54 GMT, Paul Sandoz wrote: >> Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: >> >> Rename "use_predicate" to "needs_predicate" > > Yes, the tests were run in debug mode. The reporting of the missing constant occurs for the compiled method that is called from the method where the constants are declared e.g.: > > 719 240 b jdk.incubator.vector.Int256Vector::fromArray0 (15 bytes) > ** Rejected vector op (LoadVectorMasked,int,8) because architecture does not support it > ** missing constant: offsetInRange=Parm > @ 11 jdk.incubator.vector.IntVector::fromArray0Template (22 bytes) force inline by annotation > > > So it appears to be working as expected. A similar pattern occurs at a lower-level for the passing of the mask class. `Int256Vector::fromArray0` passes a constant class to `IntVector::fromArray0Template` (the compilation of which bails out before checking that the `offsetInRange` is constant). You are right @PaulSandoz ! I ran the tests and benchmarks with your patch, and no failure and performance regression are found. I will update the patch soon. Thanks for the help! ------------- PR: https://git.openjdk.java.net/jdk/pull/8035 From shade at openjdk.java.net Fri May 13 09:19:09 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 13 May 2022 09:19:09 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding Message-ID: On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java [----------] 1 test from codestrings [ RUN ] codestrings.validate_vm ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/8695/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8695&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286660 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8695.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8695/head:pull/8695 PR: https://git.openjdk.java.net/jdk/pull/8695 From xgong at openjdk.java.net Fri May 13 09:57:40 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 13 May 2022 09:57:40 GMT Subject: RFR: 8283667: [vectorapi] Vectorization for masked load with IOOBE with predicate feature [v4] In-Reply-To: References: Message-ID: <5WQ3tKFVgp4s4hW0rMZ4aVOo24I32lsIcrFG2cqkszc=.62ce9cc7-9b41-4a59-adff-cbd50e34069f@github.com> > Currently the vector load with mask when the given index happens out of the array boundary is implemented with pure java scalar code to avoid the IOOBE (IndexOutOfBoundaryException). This is necessary for architectures that do not support the predicate feature. Because the masked load is implemented with a full vector load and a vector blend applied on it. And a full vector load will definitely cause the IOOBE which is not valid. However, for architectures that support the predicate feature like SVE/AVX-512/RVV, it can be vectorized with the predicated load instruction as long as the indexes of the masked lanes are within the bounds of the array. For these architectures, loading with unmasked lanes does not raise exception. > > This patch adds the vectorization support for the masked load with IOOBE part. Please see the original java implementation (FIXME: optimize): > > > @ForceInline > public static > ByteVector fromArray(VectorSpecies species, > byte[] a, int offset, > VectorMask m) { > ByteSpecies vsp = (ByteSpecies) species; > if (offset >= 0 && offset <= (a.length - species.length())) { > return vsp.dummyVector().fromArray0(a, offset, m); > } > > // FIXME: optimize > checkMaskFromIndexSize(offset, vsp, m, 1, a.length); > return vsp.vOp(m, i -> a[offset + i]); > } > > Since it can only be vectorized with the predicate load, the hotspot must check whether the current backend supports it and falls back to the java scalar version if not. This is different from the normal masked vector load that the compiler will generate a full vector load and a vector blend if the predicate load is not supported. So to let the compiler make the expected action, an additional flag (i.e. `usePred`) is added to the existing "loadMasked" intrinsic, with the value "true" for the IOOBE part while "false" for the normal load. And the compiler will fail to intrinsify if the flag is "true" and the predicate load is not supported by the backend, which means that normal java path will be executed. > > Also adds the same vectorization support for masked: > - fromByteArray/fromByteBuffer > - fromBooleanArray > - fromCharArray > > The performance for the new added benchmarks improve about `1.88x ~ 30.26x` on the x86 AVX-512 system: > > Benchmark before After Units > LoadMaskedIOOBEBenchmark.byteLoadArrayMaskIOOBE 737.542 1387.069 ops/ms > LoadMaskedIOOBEBenchmark.doubleLoadArrayMaskIOOBE 118.366 330.776 ops/ms > LoadMaskedIOOBEBenchmark.floatLoadArrayMaskIOOBE 233.832 6125.026 ops/ms > LoadMaskedIOOBEBenchmark.intLoadArrayMaskIOOBE 233.816 7075.923 ops/ms > LoadMaskedIOOBEBenchmark.longLoadArrayMaskIOOBE 119.771 330.587 ops/ms > LoadMaskedIOOBEBenchmark.shortLoadArrayMaskIOOBE 431.961 939.301 ops/ms > > Similar performance gain can also be observed on 512-bit SVE system. Xiaohong Gong has updated the pull request incrementally with one additional commit since the last revision: Use integer constant for offsetInRange all the way through ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8035/files - new: https://git.openjdk.java.net/jdk/pull/8035/files/9c69206e..07edfbd5 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8035&range=02-03 Stats: 438 lines in 39 files changed: 33 ins; 118 del; 287 mod Patch: https://git.openjdk.java.net/jdk/pull/8035.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8035/head:pull/8035 PR: https://git.openjdk.java.net/jdk/pull/8035 From eliu at openjdk.java.net Fri May 13 10:07:28 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Fri, 13 May 2022 10:07:28 GMT Subject: RFR: 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size [v3] In-Reply-To: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> References: <6KkmSzP3na8JUnMFh3Yfzm6yuzk1EWr1LORhDLDPxRM=.c4bf6878-3964-4b9e-9225-f0fe3da1fc16@github.com> Message-ID: <0MVuCv_uSJ7N8THWankvN5eXbb7fPXyPeSytqUIVAa0=.fd5bd524-a064-4639-8d05-22973b7d0294@github.com> > This patch speeds up add/mul/min/max reductions for SVE for 64/128 > vector size. > > According to Neoverse N2/V1 software optimization guide[1][2], for > 128-bit vector size reduction operations, we prefer using NEON > instructions instead of SVE instructions. This patch adds some rules to > distinguish 64/128 bits vector size with others, so that for these two > special cases, they can generate code the same as NEON. E.g., For > ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" > generates code as below: > > > Before: > uaddv d17, p0, z16.b > smov x15, v17.b[0] > add w15, w14, w15, sxtb > > After: > addv b17, v16.16b > smov x12, v17.b[0] > add w12, w12, w16, sxtb > > No multiply reduction instruction in SVE, this patch generates code for > MulReductionVL by using scalar insnstructions for 128-bit vector size. > > With this patch, all of them have performance gain for specific vector > micro benchmarks in my SVE testing system. > > [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ > [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 > > Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c Eric Liu has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - refine m4 Change-Id: I7d76e606485727ca1f3de1d3af733f7e28fb9867 - Merge jdk:master Change-Id: I275eb5834eacce029bc286b1b48128f07dd4070e - Generate SVE reduction for MIN/MAX/ADD as before Change-Id: Ibc6b9c1f46c42cd07f7bb73b81ed38829e9d0975 - 8282875: AArch64: [vectorapi] Optimize Vector.reduceLane for SVE 64/128 vector size This patch speeds up add/mul/min/max reductions for SVE for 64/128 vector size. According to Neoverse N2/V1 software optimization guide[1][2], for 128-bit vector size reduction operations, we prefer using NEON instructions instead of SVE instructions. This patch adds some rules to distinguish 64/128 bits vector size with others, so that for these two special cases, they can generate code the same as NEON. E.g., For ByteVector.SPECIES_128, "ByteVector.reduceLanes(VectorOperators.ADD)" generates code as below: ``` Before: uaddv d17, p0, z16.b smov x15, v17.b[0] add w15, w14, w15, sxtb After: addv b17, v16.16b smov x12, v17.b[0] add w12, w12, w16, sxtb ``` No multiply reduction instruction in SVE, this patch generates code for MulReductionVL by using scalar insnstructions for 128-bit vector size. With this patch, all of them have performance gain for specific vector micro benchmarks in my SVE testing system. [1] https://developer.arm.com/documentation/pjdoc466751330-9685/latest/ [2] https://developer.arm.com/documentation/PJDOC-466751330-18256/0001 Change-Id: I4bef0b3eb6ad1bac582e4236aef19787ccbd9b1c ------------- Changes: https://git.openjdk.java.net/jdk/pull/7999/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7999&range=02 Stats: 1653 lines in 6 files changed: 672 ins; 691 del; 290 mod Patch: https://git.openjdk.java.net/jdk/pull/7999.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7999/head:pull/7999 PR: https://git.openjdk.java.net/jdk/pull/7999 From smonteith at openjdk.java.net Fri May 13 11:07:50 2022 From: smonteith at openjdk.java.net (Stuart Monteith) Date: Fri, 13 May 2022 11:07:50 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding In-Reply-To: References: Message-ID: On Fri, 13 May 2022 09:10:09 GMT, Aleksey Shipilev wrote: > On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): > > > $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java > > [----------] 1 test from codestrings > [ RUN ] codestrings.validate_vm That's interesting. While the permanently undefined instruction, and immediate, is defined, I suppose it could be treated as an undefined instruction, and ignored. We don't actually emit or even define this instruction in the aarch64 assembly, so it can be safely ignored. LGTM. ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From duke at openjdk.java.net Fri May 13 11:46:43 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Fri, 13 May 2022 11:46:43 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v9] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: moved comment, added newlines between functions ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/4bb56ba4..9f6f0438 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=07-08 Stats: 112 lines in 1 file changed: 64 ins; 47 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Fri May 13 11:50:32 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Fri, 13 May 2022 11:50:32 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis [v3] In-Reply-To: References: Message-ID: > `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. > Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. > > We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. > If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. > > However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. > > Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. > > **Why is this important?** > I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. > > **Detailed analysis of that case:** > > type `i: [minint...0]` > access to `c[i-1]` > > **Range-check:** > `int index = AddI(i, -1)` > -> type index: [minint-1 ... -1] -> underflow > We detect that this AddI may have 2 ranges: > `tr1: int:<=-1` > `tr2: int:max `(underflow: minint-1) > > We then check how these ranges compare to in2: > `t2: int:>=0` > > For this we compute: > `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] > `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] > > But then, we only do something with this result if `cmp1 == cmp2`. > We never detect that the `Bool [lt] `could never be true. > > > **Data-flow:** > `long index = ConvI2L( AddI(i, -1) )` > -> type of` ConvI2L: [0...maxint-1]` > -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. > Then there is a push_thru_add, and we get: > `long index = AddL( ConvI2L(i), -1)` > -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. > Consequence: if hi is maxint or maxint-1, there is no overflow. > Then, we statically detect that: > type `i: [minint...0]` > type` ConvI2L: [1...maxint-1]` > -> filter results in `TOP` -> data-flow is eliminated sucessfully. > > > Added **regression test** that matches this example above. > Larger test suite passes. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fixed bug number ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8679/files - new: https://git.openjdk.java.net/jdk/pull/8679/files/7d343308..16ab5b9f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8679&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8679&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8679.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8679/head:pull/8679 PR: https://git.openjdk.java.net/jdk/pull/8679 From duke at openjdk.java.net Fri May 13 11:54:37 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Fri, 13 May 2022 11:54:37 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v10] In-Reply-To: References: Message-ID: > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: fixing file copyright year ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/9f6f0438..5b0f43e6 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=09 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=08-09 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From shade at openjdk.java.net Fri May 13 12:40:48 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 13 May 2022 12:40:48 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding In-Reply-To: References: Message-ID: On Fri, 13 May 2022 11:04:06 GMT, Stuart Monteith wrote: > That's interesting. While the permanently undefined instruction, and immediate, is defined, I suppose it could be treated as an undefined instruction, and ignored. We don't actually emit or even define this instruction in the aarch64 assembly, so it can be safely ignored. LGTM. Yeah. I think that's just the zeroed memory at the end of the code buffer that gets disassembled into `udf`. So maybe I should just match `udf #0` without matching other masks. We could probably figure out why that tail is not `nop`-ped fully, but it does not seem worth investing into. ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From smonteith at openjdk.java.net Fri May 13 12:56:46 2022 From: smonteith at openjdk.java.net (Stuart Monteith) Date: Fri, 13 May 2022 12:56:46 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding In-Reply-To: References: Message-ID: On Fri, 13 May 2022 09:10:09 GMT, Aleksey Shipilev wrote: > On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): > > > $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java > > [----------] 1 test from codestrings > [ RUN ] codestrings.validate_vm I didn't consider that, I assumed the code would be all zeros and nopped for a specific purpose. I'd be happy with `udf #0` as anything else would be likely to be problematic garbage. The immediate field is the least significant bottom 16 bits, so two ASCII bytes followed by two zero bytes would be disassembled as a udf instruction. ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From shade at openjdk.java.net Fri May 13 13:05:38 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Fri, 13 May 2022 13:05:38 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: Message-ID: > On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): > > > $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java > > [----------] 1 test from codestrings > [ RUN ] codestrings.validate_vm Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Accept udf 0 only ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8695/files - new: https://git.openjdk.java.net/jdk/pull/8695/files/b6af01d4..33c87181 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8695&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8695&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8695.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8695/head:pull/8695 PR: https://git.openjdk.java.net/jdk/pull/8695 From kvn at openjdk.java.net Fri May 13 14:52:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 13 May 2022 14:52:42 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors Update is good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8690 From kvn at openjdk.java.net Fri May 13 14:52:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 13 May 2022 14:52:43 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: <2uo04RHGgMjH5NJ2Vs6n7A9MTjAvcxWwiGoWbhakzoU=.ce458cf7-84e6-4014-9853-4d2ffddffbd1@github.com> References: <-kBZstxHuNYypM3lf6ZPEEdSByhxvX-HNA0T9clYeyg=.dd27fb71-6ddf-4d07-a946-0909b97cd9df@github.com> <2uo04RHGgMjH5NJ2Vs6n7A9MTjAvcxWwiGoWbhakzoU=.ce458cf7-84e6-4014-9853-4d2ffddffbd1@github.com> Message-ID: <1smhh7NUqxc_e-hmhx_UFfj-hZHbufMPH31B6SimSPg=.5d1caace-3f48-43b2-83c0-0005a8e641d4@github.com> On Fri, 13 May 2022 03:43:24 GMT, Dean Long wrote: > > Thank you for finding the cause and fixing the issue! > > Thanks Vladimir! I'm a little surprised the code has been broken since 2015 and we never noticed until now. Yes, it was obvious copy/paste issue we missed during review of original changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From kvn at openjdk.java.net Fri May 13 15:08:55 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 13 May 2022 15:08:55 GMT Subject: RFR: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly In-Reply-To: References: Message-ID: On Fri, 13 May 2022 07:45:18 GMT, Christian Hagedorn wrote: > This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. > > In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: > > > # PrintIdeal output of retainDenominator() > 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} > 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator > > # Safepoint interruption > > > # Enqueuing of another test class method identityThird() > > > @ bci:4 (line 130) > > # Continue to dump PrintIdeal of retainDenominator() > 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) > 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) > > > The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. > > In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. > > The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. > > Thanks, > Christian I am concern that PrintIdeal output is interrupted by output from other threads. It may cause other issues in future again. Can we redirect `PrintIdeal` output into a separate file or reorder output like `LogCompilation` since it is used for IR testing now (automatic tool)? Originally it was not matter since such output was not used in any tool. ------------- PR: https://git.openjdk.java.net/jdk/pull/8692 From psandoz at openjdk.java.net Fri May 13 15:18:02 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Fri, 13 May 2022 15:18:02 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors I don't feel qualified to review the exact changes, but nice work finding the cause and a fix. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From smonteith at openjdk.java.net Fri May 13 15:29:47 2022 From: smonteith at openjdk.java.net (Stuart Monteith) Date: Fri, 13 May 2022 15:29:47 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 13:05:38 GMT, Aleksey Shipilev wrote: >> On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): >> >> >> $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java >> >> [----------] 1 test from codestrings >> [ RUN ] codestrings.validate_vm > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Accept udf 0 only LGTM ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From kvn at openjdk.java.net Fri May 13 15:41:56 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 13 May 2022 15:41:56 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 11:50:32 GMT, Emanuel Peter wrote: >> `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. >> Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. >> >> We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. >> If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. >> >> However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. >> >> Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. >> >> **Why is this important?** >> I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. >> >> **Detailed analysis of that case:** >> >> type `i: [minint...0]` >> access to `c[i-1]` >> >> **Range-check:** >> `int index = AddI(i, -1)` >> -> type index: [minint-1 ... -1] -> underflow >> We detect that this AddI may have 2 ranges: >> `tr1: int:<=-1` >> `tr2: int:max `(underflow: minint-1) >> >> We then check how these ranges compare to in2: >> `t2: int:>=0` >> >> For this we compute: >> `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] >> `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] >> >> But then, we only do something with this result if `cmp1 == cmp2`. >> We never detect that the `Bool [lt] `could never be true. >> >> >> **Data-flow:** >> `long index = ConvI2L( AddI(i, -1) )` >> -> type of` ConvI2L: [0...maxint-1]` >> -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. >> Then there is a push_thru_add, and we get: >> `long index = AddL( ConvI2L(i), -1)` >> -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. >> Consequence: if hi is maxint or maxint-1, there is no overflow. >> Then, we statically detect that: >> type `i: [minint...0]` >> type` ConvI2L: [1...maxint-1]` >> -> filter results in `TOP` -> data-flow is eliminated sucessfully. >> >> >> Added **regression test** that matches this example above. >> Larger test suite passes. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fixed bug number Nice analysis. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8679 From ngasson at openjdk.java.net Fri May 13 16:04:54 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Fri, 13 May 2022 16:04:54 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 13:05:38 GMT, Aleksey Shipilev wrote: >> On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): >> >> >> $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java >> >> [----------] 1 test from codestrings >> [ RUN ] codestrings.validate_vm > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Accept udf 0 only Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From xliu at openjdk.java.net Fri May 13 17:11:48 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 13 May 2022 17:11:48 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps In-Reply-To: <6j5-fkcqNLNpblA4P2i4A2EGN40O3_mprSKLAxt2wSY=.e35166cf-2249-4d48-90e6-8e4dfda941a9@github.com> References: <6j5-fkcqNLNpblA4P2i4A2EGN40O3_mprSKLAxt2wSY=.e35166cf-2249-4d48-90e6-8e4dfda941a9@github.com> Message-ID: On Fri, 6 May 2022 22:40:39 GMT, Vladimir Kozlov wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Item 2. Yes, that is what will happened and that is why we may do this optimization. Your original words were confusing. > > Again, MDO may not be update for rare case even after running in Interpreter for some time. As result recompiled code will be the same and we again hit unc trap. > > In my additional comment I stated that placing uncommon trap to BC after merge point is wrong. You may not have all info in general cases (several branches merging to the same BC). hi, @vnkozlov I try this idea with another approach. I move it from parser to optimizer, right after igvn. This approach keeps bci of uncommon_trap. I remember a speculative bci in IfNode when it has a stable_if. One corner case is that 'fold-compares' of IfNode may share an uncommon_trap. The speculative bci would be wrong if this transformation occurs, so I drop this case. I also come up an idea to workaround the case that current bytecode eg. 'if_acmpeq' does reference scalarized objects. The operands are in stack of uncommon_trap's JVMState. If a to-be-killed local variable is same as either lhs or rhs, just be conservative. @TobiHartmann With that configuration, C2 compiles the IRTest before the method is mature. Parser doesn't generate unstable_if at all. I remove uncommon_trap check from the test. thanks, --lx ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From vlivanov at openjdk.java.net Fri May 13 18:02:47 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 13 May 2022 18:02:47 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors Nice catch, Dean! Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8690 From vlivanov at openjdk.java.net Fri May 13 19:17:46 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 13 May 2022 19:17:46 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: <9NhIJsBLpV42NNz7rjhBu_cEvljMy1KIAA7IdTz1aGM=.1ac390e6-4834-4616-b85d-fda842c8e4fa@github.com> References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> <9NhIJsBLpV42NNz7rjhBu_cEvljMy1KIAA7IdTz1aGM=.1ac390e6-4834-4616-b85d-fda842c8e4fa@github.com> Message-ID: On Thu, 12 May 2022 17:26:44 GMT, Jorn Vernee wrote: >>> Do nothing special for async exceptions. i.e. if they happen anywhere between 1. and 6. they will end up in one of the fallback handlers and the VM will be terminated. >> >> My understanding is that if they materialize while we're executing the upcall Java code, if that code has a try/catch block, we will go there, rather than crash the VM. >> >> In other words, IMHO the only problem with async exception is if they occur _after_ the Java user code has completed, because that will crash the Java adapter, this preventing it from returning to native call cleanly. >> >> So, either we disable async exceptions during that phase (e.g. after user code has executed, but before we return back to native code), or we just punt and stop. Since this seems like a corner^3 case, and since there are also other issue with upcalls that can occur if other threads do not cooperate (e.g. an upcall can get stuck into an infinite safepoint if the VM exits while an async native thread runs the upcall), and given that obtaining a linker is a restricted operation anyway, I don't think we should bend over backwards to try to add 1% more safety to something that's unavoidably sharp anyways. > > Ok. Then, if no one objects, I will leave this area as-is for now. (and perhaps come back to this issue in the future, if it becomes more pressing). > > (I'll also note that this issue already exists in the current code that's in mainline. So, it seems fair to address this as a followup as well, if needed). I don't see a way to reliably handle async exceptions purely with `try/catch`. In the end, there's always a safepoint poll right before returning from a method where new exception can be installed. So, there's always a small chance present to observe a pending exception on VM side irrespective of how hard you try on Java side. >From reliability perspective, I find it important to gracefully handle such corner cases. But I'm fine with addressing the problem separately. As an alternative solution, exception handling for upcalls can be handled by another upcall (into catch handler when pending exception is encountered). As a bonus, it allows to handle repeated exceptions. In the worst case, it would manifest as a hang (when async exceptions are continuously delivered). Still, some special handling is needed for stack overflow errors. (Not sure how those are handled now. Are they?) ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From vlivanov at openjdk.java.net Fri May 13 19:24:41 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 13 May 2022 19:24:41 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v17] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:58:36 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 98 commits: > > - Merge branch 'master' into JEP-19-VM-IMPL2 > - Undo spurious changes. > - Merge branch 'JEP-19-VM-IMPL2' of https://github.com/JornVernee/jdk into JEP-19-VM-IMPL2 > - Apply copyright year updates per request of @nick-arm > > Co-authored-by: Nick Gasson > - Fix overwritten copyright years. > - Missed 2 years > - Update Oracle copyright years > - Revert "Block async exceptions during upcalls" > > This reverts commit b29ad8f46732666f2d07e63ce8701b1eb7bed790. > - Block async exceptions during upcalls > - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 > - ... and 88 more: https://git.openjdk.java.net/jdk/compare/2c5d1362...f55b6c59 Looks good. src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5531: > 5529: } > 5530: > 5531: // On64 bit we will store integer like items to the stack as Missing space. src/hotspot/cpu/x86/macroAssembler_x86.cpp line 933: > 931: } else { > 932: assert(dst.is_single_reg(), "not a stack pair: (%s, %s), (%s, %s)", > 933: src.first()->name(), src.second()->name(), dst.first()->name(), dst.second()->name()); Still not indented properly. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7959 From dlong at openjdk.java.net Fri May 13 19:38:49 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 19:38:49 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: <2IFdQaUvfHBa3chi5VliZ9qPCPIXU6_6hOI3wbdQVBs=.0f3a83cc-53eb-4008-aa71-88c8f79629b3@github.com> On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors There is one test failing with this assert: assert(((!attributes->uses_vl()) || (attributes->get_vector_len() == AVX_512bit) || (!_legacy_mode_vl) || (attributes->is_legacy_mode())),"XMM register should be 0-15"); I'll need to investigate. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Fri May 13 19:46:36 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 13 May 2022 19:46:36 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors -XX:+UseKNLSetting is a problem. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From kvn at openjdk.java.net Fri May 13 19:52:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 13 May 2022 19:52:38 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors Why you not using `vextractf128_high` if you need to save 128 uppers bits? ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From jvernee at openjdk.java.net Fri May 13 20:03:11 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Fri, 13 May 2022 20:03:11 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v17] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 19:16:36 GMT, Vladimir Ivanov wrote: >> Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 98 commits: >> >> - Merge branch 'master' into JEP-19-VM-IMPL2 >> - Undo spurious changes. >> - Merge branch 'JEP-19-VM-IMPL2' of https://github.com/JornVernee/jdk into JEP-19-VM-IMPL2 >> - Apply copyright year updates per request of @nick-arm >> >> Co-authored-by: Nick Gasson >> - Fix overwritten copyright years. >> - Missed 2 years >> - Update Oracle copyright years >> - Revert "Block async exceptions during upcalls" >> >> This reverts commit b29ad8f46732666f2d07e63ce8701b1eb7bed790. >> - Block async exceptions during upcalls >> - Merge branch 'foreign-preview-m' into JEP-19-VM-IMPL2 >> - ... and 88 more: https://git.openjdk.java.net/jdk/compare/2c5d1362...f55b6c59 > > src/hotspot/cpu/aarch64/macroAssembler_aarch64.cpp line 5531: > >> 5529: } >> 5530: >> 5531: // On64 bit we will store integer like items to the stack as > > Missing space. Oh, looks like i deleted the wrong space by accident. > src/hotspot/cpu/x86/macroAssembler_x86.cpp line 933: > >> 931: } else { >> 932: assert(dst.is_single_reg(), "not a stack pair: (%s, %s), (%s, %s)", >> 933: src.first()->name(), src.second()->name(), dst.first()->name(), dst.second()->name()); > > Still not indented properly. Shouldn't there be a 2-space indentation wrt the assert here? I could also indent all the arguments to be aligned with the format string, if that seems better. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Fri May 13 20:15:47 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Fri, 13 May 2022 20:15:47 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v7] In-Reply-To: References: <1Y6gupd3SigdhlFjonKu4x4z-0mwPwCc-LJsqt6pp4c=.201b0c49-4887-4440-8526-31420fa3dda3@github.com> <_lwAL7Yg4Rr98gmWeQisR1ioc8MkVK87npZEUbB4vOw=.6434e6a4-35b9-4f81-9df3-d71973f1d75e@github.com> <9J0HneQ8kNy0t1-JDUQsXzoj4ljYwg80jiespX8laL8=.c02f9a8e-04b5-40db-8024-cdec556fcc53@github.com> <95e2d32uLxJbWldoqsr9yAoT3LD8Yyd6cLmnFuvSEOI=.4e961828-6086-4c63-9bc3-6bb60f8a5931@github.com> <9NhIJsBLpV42NNz7rjhBu_cEvljMy1KIAA7IdTz1aGM=.1ac390e6-4834-4616-b85d-fda842c8e4fa@github.com> Message-ID: On Fri, 13 May 2022 19:13:46 GMT, Vladimir Ivanov wrote: >> Ok. Then, if no one objects, I will leave this area as-is for now. (and perhaps come back to this issue in the future, if it becomes more pressing). >> >> (I'll also note that this issue already exists in the current code that's in mainline. So, it seems fair to address this as a followup as well, if needed). > > I don't see a way to reliably handle async exceptions purely with `try/catch`. In the end, there's always a safepoint poll right before returning from a method where new exception can be installed. So, there's always a small chance present to observe a pending exception on VM side irrespective of how hard you try on Java side. > > From reliability perspective, I find it important to gracefully handle such corner cases. But I'm fine with addressing the problem separately. > > As an alternative solution, exception handling for upcalls can be handled by another upcall (into catch handler when pending exception is encountered). As a bonus, it allows to handle repeated exceptions. In the worst case, it would manifest as a hang (when async exceptions are continuously delivered). Still, some special handling is needed for stack overflow errors. (Not sure how those are handled now. Are they?) SOE (of the Java exception kind) is not specially handled right now. I think the same rule applies there: we can't unwind or return to native frames (at least not without some guidance from the user). I've filed an issue here to capture some of the discussion: https://bugs.openjdk.java.net/browse/JDK-8286761 ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From vlivanov at openjdk.java.net Fri May 13 20:50:46 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 13 May 2022 20:50:46 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v17] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 19:59:40 GMT, Jorn Vernee wrote: >> src/hotspot/cpu/x86/macroAssembler_x86.cpp line 933: >> >>> 931: } else { >>> 932: assert(dst.is_single_reg(), "not a stack pair: (%s, %s), (%s, %s)", >>> 933: src.first()->name(), src.second()->name(), dst.first()->name(), dst.second()->name()); >> >> Still not indented properly. > > Shouldn't there be a 2-space indentation wrt the assert here? I could also indent all the arguments to be aligned with the format string, if that seems better. It's preferred to indent multi-line argument lists on the column where argument list starts. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From vlivanov at openjdk.java.net Fri May 13 20:50:46 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 13 May 2022 20:50:46 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v17] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 20:46:19 GMT, Vladimir Ivanov wrote: >> Shouldn't there be a 2-space indentation wrt the assert here? I could also indent all the arguments to be aligned with the format string, if that seems better. > > It's preferred to indent multi-line argument lists on the column where argument list starts. assert(dst.is_single_reg(), "not a stack pair: (%s, %s), (%s, %s)", src.first()->name(), src.second()->name(), dst.first()->name(), dst.second()->name()); ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Fri May 13 21:01:10 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Fri, 13 May 2022 21:01:10 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v18] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with two additional commits since the last revision: - indentation - fix space ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/f55b6c59..2ea5bc94 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=17 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=16-17 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Fri May 13 21:01:11 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Fri, 13 May 2022 21:01:11 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v18] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 20:47:22 GMT, Vladimir Ivanov wrote: >> It's preferred to indent multi-line argument lists on the column where argument list starts. > > assert(dst.is_single_reg(), "not a stack pair: (%s, %s), (%s, %s)", > src.first()->name(), src.second()->name(), dst.first()->name(), dst.second()->name()); Done ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From vlivanov at openjdk.java.net Fri May 13 21:06:43 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 13 May 2022 21:06:43 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 11:50:32 GMT, Emanuel Peter wrote: >> `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. >> Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. >> >> We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. >> If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. >> >> However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. >> >> Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. >> >> **Why is this important?** >> I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. >> >> **Detailed analysis of that case:** >> >> type `i: [minint...0]` >> access to `c[i-1]` >> >> **Range-check:** >> `int index = AddI(i, -1)` >> -> type index: [minint-1 ... -1] -> underflow >> We detect that this AddI may have 2 ranges: >> `tr1: int:<=-1` >> `tr2: int:max `(underflow: minint-1) >> >> We then check how these ranges compare to in2: >> `t2: int:>=0` >> >> For this we compute: >> `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] >> `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] >> >> But then, we only do something with this result if `cmp1 == cmp2`. >> We never detect that the `Bool [lt] `could never be true. >> >> >> **Data-flow:** >> `long index = ConvI2L( AddI(i, -1) )` >> -> type of` ConvI2L: [0...maxint-1]` >> -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. >> Then there is a push_thru_add, and we get: >> `long index = AddL( ConvI2L(i), -1)` >> -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. >> Consequence: if hi is maxint or maxint-1, there is no overflow. >> Then, we statically detect that: >> type `i: [minint...0]` >> type` ConvI2L: [1...maxint-1]` >> -> filter results in `TOP` -> data-flow is eliminated sucessfully. >> >> >> Added **regression test** that matches this example above. >> Larger test suite passes. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fixed bug number Looks good. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8679 From dlong at openjdk.java.net Sat May 14 00:48:50 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 14 May 2022 00:48:50 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: <1aevcU3p-YZMOQZAecyPTKjyN4pD1mSf8ZSB8DU9q5k=.2d66afd3-3a1b-4335-9ea2-e233685895f0@github.com> On Fri, 13 May 2022 19:48:58 GMT, Vladimir Kozlov wrote: > Why you not using `vextractf128_high` if you need to save 128 uppers bits? The problem is the low 128 bits for XMM16-XMM31. It requires avx512vl to read/write less than 512 bits. I see two options: 1) Save/restore all 512 bits if avx512vl is not supported. int vector_len = VM_Version::supports_avx512vl() ? Assembler::AVX_128bit : Assembler::AVX_512bit; for (int n = 16; n < num_xmm_regs; n++) { __ evmovdqul(Address(rsp, base_addr+(off++*64)), as_XMMRegister(n), vector_len); 2) Save/restore 64 or 128 bits if (VM_Version::supports_avx512vlbwdq()) { __ evmovdqul(Address(rsp, base_addr+(off++*64)), as_XMMRegister(n), Assembler::AVX_128bit); } else { __ movsd(Address(rsp, base_addr+(off++*64)), as_XMMRegister(n)); } 1) seems safer. 2) is fragile because it needs to match what `reg_class_dynamic vectorx_reg_vlbwdq` is doing in x86.ad. If reviewers like 2) better then I should probably create a new function, like c2_uses_hi_xmm_vectors(), that both x86.ad and save/restore can use. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Sat May 14 19:18:38 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 14 May 2022 19:18:38 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: <5MO3RPC5Ourjnz-Jwal25lRXt11iW--M8WBZCeS72mU=.2d014a1f-1268-46c4-a079-a90ee05a7803@github.com> On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors A third option is to copy the XMM16-XMM31 register to a XMM0-XMM15 scratch register and then do the memory write. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From kvn at openjdk.java.net Sun May 15 15:27:46 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sun, 15 May 2022 15:27:46 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 04:08:37 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > save_vectors --> save_wide_vectors I agree with option 1) but may be save whole 512 bits unconditionally with comment explaining that we need to save 128 bits but it requires avx512vl support which is not available on all CPUs. We do have space for that anyway (i*64). ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From duke at openjdk.java.net Mon May 16 07:18:08 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 16 May 2022 07:18:08 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 11:50:32 GMT, Emanuel Peter wrote: >> `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. >> Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. >> >> We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. >> If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. >> >> However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. >> >> Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. >> >> **Why is this important?** >> I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. >> >> **Detailed analysis of that case:** >> >> type `i: [minint...0]` >> access to `c[i-1]` >> >> **Range-check:** >> `int index = AddI(i, -1)` >> -> type index: [minint-1 ... -1] -> underflow >> We detect that this AddI may have 2 ranges: >> `tr1: int:<=-1` >> `tr2: int:max `(underflow: minint-1) >> >> We then check how these ranges compare to in2: >> `t2: int:>=0` >> >> For this we compute: >> `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] >> `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] >> >> But then, we only do something with this result if `cmp1 == cmp2`. >> We never detect that the `Bool [lt] `could never be true. >> >> >> **Data-flow:** >> `long index = ConvI2L( AddI(i, -1) )` >> -> type of` ConvI2L: [0...maxint-1]` >> -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. >> Then there is a push_thru_add, and we get: >> `long index = AddL( ConvI2L(i), -1)` >> -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. >> Consequence: if hi is maxint or maxint-1, there is no overflow. >> Then, we statically detect that: >> type `i: [minint...0]` >> type` ConvI2L: [1...maxint-1]` >> -> filter results in `TOP` -> data-flow is eliminated sucessfully. >> >> >> Added **regression test** that matches this example above. >> Larger test suite passes. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fixed bug number Thanks @TobiHartmann for the help. Thanks @iwanowww @vnkozlov for the reviews. ------------- PR: https://git.openjdk.java.net/jdk/pull/8679 From thartmann at openjdk.java.net Mon May 16 07:21:45 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 16 May 2022 07:21:45 GMT Subject: RFR: 8286638: C2: CmpU needs to do more precise over/underflow analysis [v3] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 11:50:32 GMT, Emanuel Peter wrote: >> `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. >> Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. >> >> We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. >> If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. >> >> However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. >> >> Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. >> >> **Why is this important?** >> I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. >> >> **Detailed analysis of that case:** >> >> type `i: [minint...0]` >> access to `c[i-1]` >> >> **Range-check:** >> `int index = AddI(i, -1)` >> -> type index: [minint-1 ... -1] -> underflow >> We detect that this AddI may have 2 ranges: >> `tr1: int:<=-1` >> `tr2: int:max `(underflow: minint-1) >> >> We then check how these ranges compare to in2: >> `t2: int:>=0` >> >> For this we compute: >> `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] >> `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] >> >> But then, we only do something with this result if `cmp1 == cmp2`. >> We never detect that the `Bool [lt] `could never be true. >> >> >> **Data-flow:** >> `long index = ConvI2L( AddI(i, -1) )` >> -> type of` ConvI2L: [0...maxint-1]` >> -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. >> Then there is a push_thru_add, and we get: >> `long index = AddL( ConvI2L(i), -1)` >> -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. >> Consequence: if hi is maxint or maxint-1, there is no overflow. >> Then, we statically detect that: >> type `i: [minint...0]` >> type` ConvI2L: [1...maxint-1]` >> -> filter results in `TOP` -> data-flow is eliminated sucessfully. >> >> >> Added **regression test** that matches this example above. >> Larger test suite passes. > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fixed bug number Nice analysis, looks good! ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8679 From duke at openjdk.java.net Mon May 16 07:24:22 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 16 May 2022 07:24:22 GMT Subject: Integrated: 8286638: C2: CmpU needs to do more precise over/underflow analysis In-Reply-To: References: Message-ID: On Thu, 12 May 2022 12:29:22 GMT, Emanuel Peter wrote: > `CmpUNode::Value` already does an under/overflow analysis, in case we have an `AddI` or `SubI` above it. > Instead of assuming the types are now the full `#int` range, we separately analyze the normal and the over/underflowed range. > > We get the two ranges `tr1` and `tr2`, which we now both compare with the right input `t2`, via `sub`. > If both `cmp1` and `cmp2` are equal, for example both are `[ge]`, and below we have a Bool node that checks for `[lt]`, we know that this can never be true. > > However, I now encountered a case where `cmp1` was `[gt]` and `cmp2` was `[ge]`. Unfortunately, they are not the same, so we just discarded our analysis, and since it is an overflow case just cannot say anything. But we could actually know that both will never be `[lt]`. > > Thus, **I propose** to take the `meet` (the union of all possible results of `cmp1` and `cmp2`. In this example, the meet would be `[ge]`. > > **Why is this important?** > I got a bug, where a ConvI2L node was able to determine the range was impossible, ripping out the data-flow. But the range-check did not manage to do the same analysis, because of an underflow. This leads to some mangled code further down. > > **Detailed analysis of that case:** > > type `i: [minint...0]` > access to `c[i-1]` > > **Range-check:** > `int index = AddI(i, -1)` > -> type index: [minint-1 ... -1] -> underflow > We detect that this AddI may have 2 ranges: > `tr1: int:<=-1` > `tr2: int:max `(underflow: minint-1) > > We then check how these ranges compare to in2: > `t2: int:>=0` > > For this we compute: > `const Type* cmp1 = sub(tr1, t2);` -> TypeInt::CC_GT = [1] > `const Type* cmp2 = sub(tr2, t2);` -> TypeInt::CC_GE = [0...1] > > But then, we only do something with this result if `cmp1 == cmp2`. > We never detect that the `Bool [lt] `could never be true. > > > **Data-flow:** > `long index = ConvI2L( AddI(i, -1) )` > -> type of` ConvI2L: [0...maxint-1]` > -> why do we know this? Because this is before an array access. We assume range-check guarantees index in range `[0...c.size()-1]`, and `c.size()<=maxint`. > Then there is a push_thru_add, and we get: > `long index = AddL( ConvI2L(i), -1)` > -> type of new `ConvI2L: [1...maxint-1]` - because we correct the lo by 1 for the add. Somehow we do not adjust hi, in my opinion it should now be maxint, to correct by 1. > Consequence: if hi is maxint or maxint-1, there is no overflow. > Then, we statically detect that: > type `i: [minint...0]` > type` ConvI2L: [1...maxint-1]` > -> filter results in `TOP` -> data-flow is eliminated sucessfully. > > > Added **regression test** that matches this example above. > Larger test suite passes. This pull request has now been integrated. Changeset: 2d34acfe Author: Emanuel Peter Committer: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/2d34acfec908e6cdfb8e920b54d5b932029e4bac Stats: 62 lines in 2 files changed: 57 ins; 1 del; 4 mod 8286638: C2: CmpU needs to do more precise over/underflow analysis Reviewed-by: kvn, vlivanov, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8679 From njian at openjdk.java.net Mon May 16 07:28:42 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Mon, 16 May 2022 07:28:42 GMT Subject: RFR: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE Message-ID: This is the REDO of JDK-8269559 and JDK-8275448. Those two backouts finally turned to be some system zlib issue in AArch64 macOS, and is not related to the patch itself. See [1][2] for details. This patch is generally the same as JDK-8275448, which uses SVE to optimize string_compare intrinsics for long string comparisons. I did a rebase with small tweaks to get better performance on recent Neoverse hardware. Test data on systems with different SVE vector sizes: case delta size 128-bits 256-bits 512-bits compareToLL 2 24 0.17% 0.58% 0.00% compareToLL 2 36 0.00% 2.25% 0.04% compareToLL 2 72 -4.40% 3.87% -12.82% compareToLL 2 128 4.55% 58.31% 13.53% compareToLL 2 256 19.39% 69.77% 82.03% compareToLL 2 512 1.81% 68.38% 170.93% compareToLU 2 24 25.57% 46.98% 54.61% compareToLU 2 36 36.03% 70.26% 94.33% compareToLU 2 72 35.86% 90.58% 146.04% compareToLU 2 128 70.82% 119.19% 266.22% compareToLU 2 256 80.77% 146.33% 420.01% compareToLU 2 512 94.62% 171.72% 530.87% compareToUL 2 24 20.82% 34.48% 62.14% compareToUL 2 36 39.77% 60.79% 69.77% compareToUL 2 72 35.46% 84.34% 121.90% compareToUL 2 128 67.77% 110.97% 220.53% compareToUL 2 256 77.05% 160.29% 331.30% compareToUL 2 512 91.88% 184.57% 524.21% compareToUU 2 24 -0.13% 0.40% 0.00% compareToUU 2 36 -9.18% 12.84% -13.93% compareToUU 2 72 1.67% 60.61% 6.69% compareToUU 2 128 13.51% 60.33% 55.27% compareToUU 2 256 2.55% 62.17% 153.26% compareToUU 2 512 4.12% 68.62% 201.68% JTreg tests passed on SVE hardware. [1] https://bugs.openjdk.java.net/browse/JDK-8275448 [2] https://bugs.openjdk.java.net/browse/JDK-8282954 ------------- Commit messages: - 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE Changes: https://git.openjdk.java.net/jdk/pull/8723/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8723&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8281712 Stats: 443 lines in 7 files changed: 433 ins; 0 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/8723.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8723/head:pull/8723 PR: https://git.openjdk.java.net/jdk/pull/8723 From chagedorn at openjdk.java.net Mon May 16 08:00:48 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 16 May 2022 08:00:48 GMT Subject: RFR: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly In-Reply-To: References: Message-ID: <33kPuIDuamEddgehtIahnoIX9chfkmubviBROzaLhH0=.6ec024b3-bb95-4171-af7d-f615b2343408@github.com> On Fri, 13 May 2022 07:45:18 GMT, Christian Hagedorn wrote: > This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. > > In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: > > > # PrintIdeal output of retainDenominator() > 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} > 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator > > # Safepoint interruption > > > # Enqueuing of another test class method identityThird() > > > @ bci:4 (line 130) > > # Continue to dump PrintIdeal of retainDenominator() > 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) > 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) > > > The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. > > In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. > > The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. > > Thanks, > Christian Thanks for your review Vladimir! > I am concern that PrintIdeal output is interrupted by output from other threads. It may cause other issues in future again. Can we redirect `PrintIdeal` output into a separate file or reorder output like `LogCompilation` since it is used for IR testing now (automatic tool)? Originally it was not matter since such output was not used in any tool. Can you elaborate more on what you mean by reordering the `LogCompilation` output? I agree that we could make this safer by dumping `PrintIdeal` and `PrintOptoAssembly` to a separate file. However, this would require a change to the VM to generate the new file which probably also needs to be documented and might be a security concern? With this new file, I think we should also add a new flag to enable/disable such a redirection because it will only be used by the IR framework. Since this kind of bug is quite rare and only concerns the IR framework, I'm not sure if it would be justified to make such a change to the VM. The current fix seems to be less invasive and hopefully now completes the entire handling of safepoints while printing. What do you think? ------------- PR: https://git.openjdk.java.net/jdk/pull/8692 From thartmann at openjdk.java.net Mon May 16 09:00:44 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 16 May 2022 09:00:44 GMT Subject: RFR: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly In-Reply-To: References: Message-ID: <5TBylKh9D_10pw8aBztm2Wo5DKUA7ZiEHajKyrYHJ7Y=.efd37f69-4cd2-449f-89a7-1aae4e29e665@github.com> On Fri, 13 May 2022 07:45:18 GMT, Christian Hagedorn wrote: > This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. > > In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: > > > # PrintIdeal output of retainDenominator() > 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} > 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator > > # Safepoint interruption > > > # Enqueuing of another test class method identityThird() > > > @ bci:4 (line 130) > > # Continue to dump PrintIdeal of retainDenominator() > 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) > 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) > > > The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. > > In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. > > The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. > > Thanks, > Christian I would also prefer to not introduce another log file just for IR matching. Together with https://github.com/openjdk/jdk/pull/8647, this should hopefully fix all issues related to safepointing while printing. The change looks good to me. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8692 From shade at openjdk.java.net Mon May 16 09:27:55 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 16 May 2022 09:27:55 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: Message-ID: <5vtBiNnzX5EjxdB_J_03SSELKPp_A7GiZKnmHs6Z1aI=.701172ce-49d4-4964-8630-40f8e75bbb02@github.com> On Fri, 13 May 2022 13:05:38 GMT, Aleksey Shipilev wrote: >> On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): >> >> >> $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java >> >> [----------] 1 test from codestrings >> [ RUN ] codestrings.validate_vm > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Accept udf 0 only Any other opinions? I'll integrate this soon. ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From aph at openjdk.java.net Mon May 16 11:28:02 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 16 May 2022 11:28:02 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: <5vtBiNnzX5EjxdB_J_03SSELKPp_A7GiZKnmHs6Z1aI=.701172ce-49d4-4964-8630-40f8e75bbb02@github.com> References: <5vtBiNnzX5EjxdB_J_03SSELKPp_A7GiZKnmHs6Z1aI=.701172ce-49d4-4964-8630-40f8e75bbb02@github.com> Message-ID: On Mon, 16 May 2022 09:23:52 GMT, Aleksey Shipilev wrote: > Any other opinions? I'll integrate this soon. I guess it's OK, but ewww. Is it really right to be scanning code memory that hasn't been written? ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From shade at openjdk.java.net Mon May 16 11:28:04 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 16 May 2022 11:28:04 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: <5vtBiNnzX5EjxdB_J_03SSELKPp_A7GiZKnmHs6Z1aI=.701172ce-49d4-4964-8630-40f8e75bbb02@github.com> Message-ID: On Mon, 16 May 2022 11:21:40 GMT, Andrew Haley wrote: > > Any other opinions? I'll integrate this soon. > > I guess it's OK, but ewww. Is it really right to be scanning code memory that hasn't been written? The last time I checked, CodeBuffers are filled/zeroed to help disassemblers to parse the unwritten sections. The padding code there was quite hairy to fix... ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From dholmes at openjdk.java.net Mon May 16 11:31:56 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Mon, 16 May 2022 11:31:56 GMT Subject: RFR: 8280844: Epoch shift synchronization point for Compiler threads is inadequate In-Reply-To: References: Message-ID: On Mon, 16 May 2022 10:17:42 GMT, Markus Gr?nlund wrote: > Greetings, > > [JDK-8233111](https://bugs.openjdk.java.net/browse/JDK-8233111) attempted to address artefact tagging for Compiler threads, letting threads run _thread_in_native to avoid the transition. Unfortunately, that attempt proved inadequate. > > The epoch race is avoided only by performing the transition to _thread_in_vm. > > Testing: jdk_jfr > > Thanks > Markus src/hotspot/share/compiler/compilerEvent.cpp line 126: > 124: static inline void commit(EventType& event) { > 125: JavaThread* thread = JavaThread::current(); > 126: assert(thread->thread_state() == _thread_in_native, "invariant"); You don't need this assert as `ThreadInVMfromNative` already has it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8724 From mgronlun at openjdk.java.net Mon May 16 11:37:43 2022 From: mgronlun at openjdk.java.net (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Mon, 16 May 2022 11:37:43 GMT Subject: RFR: 8280844: Epoch shift synchronization point for Compiler threads is inadequate [v2] In-Reply-To: References: Message-ID: <1e7AHT4Z2EMgm-mxAOrVO0kLCOkjzFpzvh-yL7KNr-I=.42f520bd-25f8-4a0d-9d89-0c238dce9ef1@github.com> > Greetings, > > [JDK-8233111](https://bugs.openjdk.java.net/browse/JDK-8233111) attempted to address artefact tagging for Compiler threads, letting threads run _thread_in_native to avoid the transition. Unfortunately, that attempt proved inadequate. > > The epoch race is avoided only by performing the transition to _thread_in_vm. > > Testing: jdk_jfr > > Thanks > Markus Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: delegate assertion ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8724/files - new: https://git.openjdk.java.net/jdk/pull/8724/files/9ce10130..b4e59b72 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8724&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8724&range=00-01 Stats: 3 lines in 1 file changed: 0 ins; 2 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8724.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8724/head:pull/8724 PR: https://git.openjdk.java.net/jdk/pull/8724 From aph at openjdk.java.net Mon May 16 11:48:47 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 16 May 2022 11:48:47 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: Message-ID: <5T_ySe5jAofE9CgpX0lHkTd9lEmk5dcZt4ux3UhD2oQ=.9f37e370-f01f-4bc2-aab7-37ffe738a29d@github.com> On Fri, 13 May 2022 13:05:38 GMT, Aleksey Shipilev wrote: >> On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): >> >> >> $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java >> >> [----------] 1 test from codestrings >> [ RUN ] codestrings.validate_vm > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Accept udf 0 only Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From aph at openjdk.java.net Mon May 16 11:48:47 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 16 May 2022 11:48:47 GMT Subject: RFR: 8286660: codestrings gtest fails on AArch64: "udf" in padding [v2] In-Reply-To: References: <5vtBiNnzX5EjxdB_J_03SSELKPp_A7GiZKnmHs6Z1aI=.701172ce-49d4-4964-8630-40f8e75bbb02@github.com> Message-ID: On Mon, 16 May 2022 11:24:22 GMT, Aleksey Shipilev wrote: > > > Any other opinions? I'll integrate this soon. > > > > > > I guess it's OK, but ewww. Is it really right to be scanning code memory that hasn't been written? > > The last time I checked, CodeBuffers are filled/zeroed to help disassemblers to parse the unwritten sections. The padding code there was quite hairy to fix... Aha! OK. ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From thartmann at openjdk.java.net Mon May 16 12:11:01 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 16 May 2022 12:11:01 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: <7Kc1dbOWuBDaz9MY5Fv2fxzBObf6nDGL9HIotCiMtik=.b33b0951-4067-4640-af80-54a2defc9d01@github.com> On Wed, 27 Apr 2022 09:13:34 GMT, Jie Fu wrote: >> Hi all, >> >> According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. >> However, current implementation is incorrect for negative bytes/shorts. >> >> The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. >> >> 13 public static void urshift(byte[] src, byte[] dst) { >> 14 for (int i = 0; i < src.length; i++) { >> 15 dst[i] = (byte)(src[i] >>> 3); >> 16 } >> 17 } >> 18 >> 19 public static void urshiftVector(byte[] src, byte[] dst) { >> 20 int i = 0; >> 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { >> 22 var va = ByteVector.fromArray(spec, src, i); >> 23 var vb = va.lanewise(VectorOperators.LSHR, 3); >> 24 vb.intoArray(dst, i); >> 25 } >> 26 >> 27 for (; i < src.length; i++) { >> 28 dst[i] = (byte)(src[i] >>> 3); >> 29 } >> 30 } >> >> >> Unfortunately and to our surprise, code at line28 computes different results with code at line23. >> It took quite a long time to figure out this bug. >> >> The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. >> Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. >> So this seems unreasonable and unfriendly to Java developers. >> It would be better to fix it. >> >> The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. >> This logic is: >> - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. >> - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. >> - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. >> >> I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . >> And many thanks to @fg1417 . >> >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 > >> > > According to the Vector API doc, the LSHR operator computes a>>>(n&(ESIZE*8-1)) >> >> Documentation is correct if viewed strictly in context of subword vector lane, JVM internally promotes/sign extends subword type scalar variables into int type, but vectors are loaded from continuous memory holding subwords, it will not be correct for developer to imagine that individual subword type lanes will be upcasted into int lanes before being operated upon. >> >> Thus both java implementation and compiler handling looks correct. > > Thanks @jatin-bhateja for taking a look at this. > After the discussion, I think it's fine to keep the current implementation of LSHR. > So we're now fixing the misleading doc here: https://github.com/openjdk/jdk/pull/8291 . > > And I think it would be better to add one more operator for `>>>`. > Thanks. @DamonFool should this PR and the JBS issue be closed? ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon May 16 12:21:57 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 16 May 2022 12:21:57 GMT Subject: RFR: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 09:13:34 GMT, Jie Fu wrote: >> Hi all, >> >> According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. >> However, current implementation is incorrect for negative bytes/shorts. >> >> The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. >> >> 13 public static void urshift(byte[] src, byte[] dst) { >> 14 for (int i = 0; i < src.length; i++) { >> 15 dst[i] = (byte)(src[i] >>> 3); >> 16 } >> 17 } >> 18 >> 19 public static void urshiftVector(byte[] src, byte[] dst) { >> 20 int i = 0; >> 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { >> 22 var va = ByteVector.fromArray(spec, src, i); >> 23 var vb = va.lanewise(VectorOperators.LSHR, 3); >> 24 vb.intoArray(dst, i); >> 25 } >> 26 >> 27 for (; i < src.length; i++) { >> 28 dst[i] = (byte)(src[i] >>> 3); >> 29 } >> 30 } >> >> >> Unfortunately and to our surprise, code at line28 computes different results with code at line23. >> It took quite a long time to figure out this bug. >> >> The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. >> Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. >> So this seems unreasonable and unfriendly to Java developers. >> It would be better to fix it. >> >> The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. >> This logic is: >> - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. >> - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. >> - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. >> >> I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . >> And many thanks to @fg1417 . >> >> >> Thanks. >> Best regards, >> Jie >> >> [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 > >> > > According to the Vector API doc, the LSHR operator computes a>>>(n&(ESIZE*8-1)) >> >> Documentation is correct if viewed strictly in context of subword vector lane, JVM internally promotes/sign extends subword type scalar variables into int type, but vectors are loaded from continuous memory holding subwords, it will not be correct for developer to imagine that individual subword type lanes will be upcasted into int lanes before being operated upon. >> >> Thus both java implementation and compiler handling looks correct. > > Thanks @jatin-bhateja for taking a look at this. > After the discussion, I think it's fine to keep the current implementation of LSHR. > So we're now fixing the misleading doc here: https://github.com/openjdk/jdk/pull/8291 . > > And I think it would be better to add one more operator for `>>>`. > Thanks. > @DamonFool should this PR and the JBS issue be closed? Okay. Let's close it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From jiefu at openjdk.java.net Mon May 16 12:21:59 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Mon, 16 May 2022 12:21:59 GMT Subject: Withdrawn: 8284932: [Vector API] Incorrect implementation of LSHR operator for negative byte/short elements In-Reply-To: References: Message-ID: On Sun, 17 Apr 2022 14:35:14 GMT, Jie Fu wrote: > Hi all, > > According to the Vector API doc, the `LSHR` operator computes `a>>>(n&(ESIZE*8-1))`. > However, current implementation is incorrect for negative bytes/shorts. > > The background is that one of our customers try to vectorize `urshift` with `urshiftVector` like the following. > > 13 public static void urshift(byte[] src, byte[] dst) { > 14 for (int i = 0; i < src.length; i++) { > 15 dst[i] = (byte)(src[i] >>> 3); > 16 } > 17 } > 18 > 19 public static void urshiftVector(byte[] src, byte[] dst) { > 20 int i = 0; > 21 for (; i < spec.loopBound(src.length); i +=spec.length()) { > 22 var va = ByteVector.fromArray(spec, src, i); > 23 var vb = va.lanewise(VectorOperators.LSHR, 3); > 24 vb.intoArray(dst, i); > 25 } > 26 > 27 for (; i < src.length; i++) { > 28 dst[i] = (byte)(src[i] >>> 3); > 29 } > 30 } > > > Unfortunately and to our surprise, code at line28 computes different results with code at line23. > It took quite a long time to figure out this bug. > > The root cause is that current implemenation of Vector API can't compute the unsigned right shift results as what is done for scalar `>>>` for negative byte/short elements. > Actually, current implementation will do `(a & 0xFF) >>> (n & 7)` [1] for all bytes, which is unable to compute the vectorized `>>>` for negative bytes. > So this seems unreasonable and unfriendly to Java developers. > It would be better to fix it. > > The key idea to support unsigned right shift of negative bytes/shorts is just to replace the unsigned right shift operation with the signed right shift operation. > This logic is: > - For byte elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 24. > - For short elements, unsigned right shift is equal to signed right shift if the shift_cnt <= 16. > - For Vector API, the shift_cnt will be masked to shift_cnt <= 7 for bytes and shift_cnt <= 15 for shorts. > > I just learned this idea from https://github.com/openjdk/jdk/pull/7979 . > And many thanks to @fg1417 . > > > Thanks. > Best regards, > Jie > > [1] https://github.com/openjdk/jdk/blob/master/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java#L935 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8276 From mdoerr at openjdk.java.net Mon May 16 12:43:35 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 16 May 2022 12:43:35 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code Message-ID: The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. ------------- Commit messages: - 8286182: C2: crash with SIGFPE when executing compiled code Changes: https://git.openjdk.java.net/jdk/pull/8726/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8726&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286182 Stats: 8 lines in 2 files changed: 5 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8726.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8726/head:pull/8726 PR: https://git.openjdk.java.net/jdk/pull/8726 From thartmann at openjdk.java.net Mon May 16 13:04:44 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Mon, 16 May 2022 13:04:44 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. So the fix for [JDK-8257822](https://bugs.openjdk.java.net/browse/JDK-8257822) misses the new nodes added by [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). I wonder if this code needs to be adjusted as well: https://github.com/openjdk/jdk/blob/fa1ca98fff66fb91cfd5b00404645e0574d03101/src/hotspot/share/opto/loopnode.cpp#L5780-L5785 @chhagedorn should have a look as well. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From egahlin at openjdk.java.net Mon May 16 13:35:58 2022 From: egahlin at openjdk.java.net (Erik Gahlin) Date: Mon, 16 May 2022 13:35:58 GMT Subject: RFR: 8280844: Epoch shift synchronization point for Compiler threads is inadequate [v2] In-Reply-To: <1e7AHT4Z2EMgm-mxAOrVO0kLCOkjzFpzvh-yL7KNr-I=.42f520bd-25f8-4a0d-9d89-0c238dce9ef1@github.com> References: <1e7AHT4Z2EMgm-mxAOrVO0kLCOkjzFpzvh-yL7KNr-I=.42f520bd-25f8-4a0d-9d89-0c238dce9ef1@github.com> Message-ID: On Mon, 16 May 2022 11:37:43 GMT, Markus Gr?nlund wrote: >> Greetings, >> >> [JDK-8233111](https://bugs.openjdk.java.net/browse/JDK-8233111) attempted to address artefact tagging for Compiler threads, letting threads run _thread_in_native to avoid the transition. Unfortunately, that attempt proved inadequate. >> >> The epoch race is avoided only by performing the transition to _thread_in_vm. >> >> Testing: jdk_jfr >> >> Thanks >> Markus > > Markus Gr?nlund has updated the pull request incrementally with one additional commit since the last revision: > > delegate assertion Marked as reviewed by egahlin (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8724 From mdoerr at openjdk.java.net Mon May 16 14:02:50 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Mon, 16 May 2022 14:02:50 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: <4HMnno6o8A9IcLLDDwAtxP5lIuBvZfqLEfE_fI8rTtk=.898bbfab-c33f-404c-8eaf-0662ddd3f5d1@github.com> On Mon, 16 May 2022 13:01:50 GMT, Tobias Hartmann wrote: > So the fix for [JDK-8257822](https://bugs.openjdk.java.net/browse/JDK-8257822) misses the new nodes added by [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > > I wonder if this code needs to be adjusted as well: > > https://github.com/openjdk/jdk/blob/fa1ca98fff66fb91cfd5b00404645e0574d03101/src/hotspot/share/opto/loopnode.cpp#L5780-L5785 > > @chhagedorn should have a look as well. I was wondering about that, too. Also why we have checks for Div/ModI nodes, but not for Div/ModL at some places. We can have loops with long trip count since [JDK-8223051](https://bugs.openjdk.java.net/browse/JDK-8223051). ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From jvernee at openjdk.java.net Mon May 16 14:52:04 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Mon, 16 May 2022 14:52:04 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v19] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Fix failure with SPEC disabled (accidentally dropped change) ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/2ea5bc94..ff8835ee Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=18 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=17-18 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jbhateja at openjdk.java.net Mon May 16 15:25:55 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 16 May 2022 15:25:55 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v2] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: - 8283894: Review comments resolutions. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Extending IR framework testcase with some functional test points. - 8283894: Intrinsify compress and expand bits on x86 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/7dcbfd01..f7ed0f8d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=00-01 Stats: 217088 lines in 2755 files changed: 165017 ins; 37477 del; 14594 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Mon May 16 15:25:58 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 16 May 2022 15:25:58 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v2] In-Reply-To: References: Message-ID: <_BcFZGyUcq9X1sQeAUNp-X0FEJQKnN2E_YiE5odhRgk=.12e82234-22f2-4fc4-ab99-893d66f774c9@github.com> On Wed, 4 May 2022 17:59:20 GMT, Sandhya Viswanathan wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - 8283894: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Extending IR framework testcase with some functional test points. >> - 8283894: Intrinsify compress and expand bits on x86 > > src/hotspot/cpu/x86/x86.ad line 6191: > >> 6189: %} >> 6190: >> 6191: instruct compressBitsL_reg(rRegL dst, rRegL src, rRegL mask) %{ > > All the compress/expand rules could be moved to x86_64.ad. DONE, only integer patters which are common to both targets are in x86.ad, special instruction sequence has been introduced for bit extraction / compression for 32 bit targets. It shows 10x improvement over existing non-intrinsic routine. > src/hotspot/share/opto/intrinsicnode.cpp line 160: > >> 158: >> 159: Node* compress_expand_identity(PhaseGVN* phase, Node* n) { >> 160: BasicType bt = n->bottom_type()->array_element_basic_type(); > > Why use of array_element_basic_type() here? These are not arrays. DONE, in this case we are only concerned about INT/LONG so most accurate integral type based on value ranges is not needed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Mon May 16 15:26:02 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 16 May 2022 15:26:02 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v2] In-Reply-To: References: Message-ID: On Tue, 3 May 2022 00:04:15 GMT, John R Rose wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: >> >> - 8283894: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Extending IR framework testcase with some functional test points. >> - 8283894: Intrinsify compress and expand bits on x86 > > src/hotspot/share/opto/intrinsicnode.cpp line 155: > >> 153: return new AndLNode(compr, src->in(1)); >> 154: } >> 155: } > > I think a further rule for `compress(m, m)` could be in order. > > > compress(m, m) = m==-1 ? m : (1L << PopCount[IL](m))-1 > > > This should be its own path through `Ideal`, not special logic at this particular point. > > Don't use it unless `Matcher::match_rule_supported(Op_PopCount[IL])` is true. This is a special case when both source and mask are same and we cannot take this call from C2Compiler::is_intrinsic_supported method. So currently if target does not support compress/expand bits then we do not create intrinsic in the first place. > Can you update the jtreg tests: > > 1. Modify `CompressExpandTest` to run with and without the intrinsic enabled > 2. Disable (by default) `CompressExpandSanityTest` > ? By disable you mean move it to ProblemList.txt. > src/hotspot/share/opto/intrinsicnode.cpp line 213: > >> 211: } >> 212: >> 213: Node* ExpandBitsNode::Identity(PhaseGVN* phase) { > > I also suggest adding a boolean if `compress_expand_identity` if you add rules which don't apply to both equally. > > Here is possible type-propagation logic for compress and expand: > > > let SIGN_BIT = (((IntOrLong)-1)>>>1)+1 (bit 31 or 63) > let MAX_POS = (((IntOrLong)-1)>>>1) > lot BITS = 1+bitCount(MAX_POS) (32 or 64) > if (both x, m are con) { > // maybe use these rules, by porting the Java code to C++ > compress(CON[x], CON[m]) ] = CON[portable_compress(x,m)] > expand(CON[x], CON[m]) ] = CON[portable_expand(x,m)] > // see also https://stackoverflow.com/questions/38938911/portable-efficient-alternative-to-pdep-without-using-bmi2 > } else if (m is CON[m] && m != -1) { > //compress(x, -1) = x //identity handled elsewhere > //expand(x, -1) = x //identity handled elsewhere > let bitc = bitCount(m) > LO[ compress(x, CON[m]) ] = 0 //sign bit is never set > HI[ compress(x, CON[m]) ] = ((1L< LO[ expand(x, CON[m]) ] = (m >= 0) ? 0 : SIGN_BIT //sign bit might be set alone > HI[ expand(x, CON[m]) ] = (m >= 0) ? m : m ^ SIGN_BIT > // could improve a little by looking TYPE[x], but do not bother > } else { > // estimate maximum possible weight of m (in 0..63) > let maxbitc = BITS if (LO[m] < 0 && HI[m] >= -1) // could be -1 > else maxbitc = BITS-1 if (LO[m] < 0 || HI[m] == MAX_POS) // <0 or maxint > else maxbitc = BITS-1 - numberOfLeadingZeros(HI[m]) > LO[ compress(x, m) ] = (maxbitc == 64 && LO[x] < 0) ? SIGN_BIT : 0 > HI[ compress(x, m) ] = (maxbitc >= 63) ? HI[x] : MIN(HI[x], (1L< LO[ expand(x, m) ] = (LO[m] >= 0) ? 0 : SIGN_BIT > HI[ expand(x, m) ] = (LO[m] >= 0) ? HI[m] : MAX_POS > } > > > The operands of compress and expand are inherently unsigned bitmasks, so the signed type system of C2 gets in the way. In the future, a somewhat more thorough job could be done if we had bitwise types as well in C2. For that that would mean, see https://bugs.openjdk.java.net/browse/JDK-8001436 I have handled these transformation separately in ideal/identity and value routines. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Mon May 16 15:26:03 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 16 May 2022 15:26:03 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v2] In-Reply-To: References: Message-ID: <-GFc--2fKYjhzpidMTExUpnsRwMbVAxnPYXWeECozI8=.6c1ef2c1-282e-43e5-ad98-fe67a33213cd@github.com> On Mon, 16 May 2022 15:18:26 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/intrinsicnode.cpp line 155: >> >>> 153: return new AndLNode(compr, src->in(1)); >>> 154: } >>> 155: } >> >> I think a further rule for `compress(m, m)` could be in order. >> >> >> compress(m, m) = m==-1 ? m : (1L << PopCount[IL](m))-1 >> >> >> This should be its own path through `Ideal`, not special logic at this particular point. >> >> Don't use it unless `Matcher::match_rule_supported(Op_PopCount[IL])` is true. > > This is a special case when both source and mask are same and we cannot take this call from C2Compiler::is_intrinsic_supported method. So currently if target does not support compress/expand bits then we do not create intrinsic in the first place. > >> Can you update the jtreg tests: >> >> 1. Modify `CompressExpandTest` to run with and without the intrinsic enabled >> 2. Disable (by default) `CompressExpandSanityTest` >> ? > > By disable you mean move it to ProblemList.txt. > I think a further rule for `compress(m, m)` could be in order. > > ``` > compress(m, m) = m==-1 ? m : (1L << PopCount[IL](m))-1 > ``` > > This should be its own path through `Ideal`, not special logic at this particular point. > > Don't use it unless `Matcher::match_rule_supported(Op_PopCount[IL])` is true. This is a special case when both source and mask are same and we cannot take this call from C2Compiler::is_intrinsic_supported method. So currently if target does not support compress/expand bits then we do not create intrinsic in the first place. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Mon May 16 16:05:46 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 16 May 2022 16:05:46 GMT Subject: RFR: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly In-Reply-To: References: Message-ID: On Fri, 13 May 2022 07:45:18 GMT, Christian Hagedorn wrote: > This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. > > In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: > > > # PrintIdeal output of retainDenominator() > 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} > 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator > > # Safepoint interruption > > > # Enqueuing of another test class method identityThird() > > > @ bci:4 (line 130) > > # Continue to dump PrintIdeal of retainDenominator() > 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) > 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) > > > The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. > > In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. > > The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. > > Thanks, > Christian Okay if you say so. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8692 From jvernee at openjdk.java.net Mon May 16 16:06:24 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Mon, 16 May 2022 16:06:24 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v20] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Cleanup UL usage ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/ff8835ee..d611f365 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=19 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=18-19 Stats: 14 lines in 5 files changed: 2 ins; 1 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Mon May 16 16:06:25 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Mon, 16 May 2022 16:06:25 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v19] In-Reply-To: References: Message-ID: On Mon, 16 May 2022 14:52:04 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > Fix failure with SPEC disabled (accidentally dropped change) @robehn found a test failure in a non-default configuration that I've fixed [1] (it was addressed in the panama repo by a different patch). I've also cleaned up use of UL a bit [2]: the `panama` tag was renamed to `foreign` and I've added the `downcall` and `upcall` tags as well, for down/up call logging respectively. This is now all under NOT_PRODUCT. [1]: https://github.com/openjdk/jdk/pull/7959/commits/ff8835ee99203e94fb216c5bd7cf1ce610d5737f [2]: https://github.com/openjdk/jdk/pull/7959/commits/d611f365ade15cd7a7d005547814ce88fff0ca1a ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Mon May 16 16:15:49 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Mon, 16 May 2022 16:15:49 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v21] In-Reply-To: References: Message-ID: <-H-hj0CmArcV48YOFCPUC1yIZXJxZ41p9PGBP-E_Vc0=.dad63c91-0e01-4124-9b44-467134a26b75@github.com> > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: Missing ASSERT -> NOT_PRODUCT ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/d611f365..406f3e83 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=20 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=19-20 Stats: 2 lines in 2 files changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From jbhateja at openjdk.java.net Mon May 16 17:05:27 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 16 May 2022 17:05:27 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v3] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/f7ed0f8d..93aa5e2d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From dlong at openjdk.java.net Mon May 16 20:10:29 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 16 May 2022 20:10:29 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). Dean Long has updated the pull request incrementally with one additional commit since the last revision: Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8690/files - new: https://git.openjdk.java.net/jdk/pull/8690/files/a2bc8306..014035e2 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8690&range=02-03 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8690.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8690/head:pull/8690 PR: https://git.openjdk.java.net/jdk/pull/8690 From vlivanov at openjdk.java.net Mon May 16 20:27:34 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Mon, 16 May 2022 20:27:34 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Mon, 16 May 2022 20:10:29 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set Still looks good. I prefer option #1 as well. ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8690 From kvn at openjdk.java.net Mon May 16 21:43:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 16 May 2022 21:43:44 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: <97MR1dFM2_fsHIT-Cjb_nD3hXzj7G4vPUBTVtSRtp30=.93fc5ec0-ffde-4364-a56c-c2072b307e0d@github.com> On Mon, 16 May 2022 20:10:29 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set Update looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8690 From sviswanathan at openjdk.java.net Mon May 16 22:25:41 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 16 May 2022 22:25:41 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Mon, 16 May 2022 20:10:29 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 237: > 235: } else { > 236: if (VM_Version::supports_evex()) { > 237: // Save upper bank of XMM registers(16..31) for scalar or 16-byte vector usage I have a question: This is on the else path where save_wide_vectors is false. Why do we need to save the full vector registers here? Could this be a problem at the call site of this method (save_live_registers) from where save_wide_vectors is not being sent as true? ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Mon May 16 22:39:33 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 16 May 2022 22:39:33 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v3] In-Reply-To: References: Message-ID: On Sun, 15 May 2022 15:24:13 GMT, Vladimir Kozlov wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> save_vectors --> save_wide_vectors > > I agree with option 1) but may be save whole 512 bits unconditionally with comment explaining that we need to save 128 bits but it requires avx512vl support which is not available on all CPUs. We do have space for that anyway (i*64). > > But it would be more memory traffic :( So lets do as you suggested 1). At least in most cases (avx512vl supported) we get less traffic. Thanks @vnkozlov, @iwanowww, and @PaulSandoz. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From dlong at openjdk.java.net Mon May 16 22:39:33 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 16 May 2022 22:39:33 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Mon, 16 May 2022 22:22:11 GMT, Sandhya Viswanathan wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set > > src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp line 237: > >> 235: } else { >> 236: if (VM_Version::supports_evex()) { >> 237: // Save upper bank of XMM registers(16..31) for scalar or 16-byte vector usage > > I have a question: > This is on the else path where save_wide_vectors is false. Why do we need to save the full vector registers here? > Could this be a problem at the call site of this method (save_live_registers) from where save_wide_vectors is not being sent as true? C2 uses SharedRuntime::is_wide_vector(C->max_vector_size()) to determine the value of save_wide_vectors, and is_wide_vector() does `return size > 16;` so we need to save vector bytes 0 - 15 even if save_wide_vectors is false. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From xliu at openjdk.java.net Mon May 16 22:50:43 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Mon, 16 May 2022 22:50:43 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v2] In-Reply-To: References: Message-ID: <3hWYncRMCaWRwoXJA-jSHTJ9yzl_puvIrGDAwZlu9mE=.5ab1cda1-4ac1-4121-93a4-358ac6b80f76@github.com> On Wed, 11 May 2022 07:24:06 GMT, Roland Westrelin wrote: >> Looks very good! >> >> Sorry for the delay with the review. > >> Looks very good! > > Thanks for the review. hi, @rwestrel , I see this patch uses covariant return in a few places, eg. - virtual const Type *cast_to_exactness(bool klass_is_exact) const; + virtual const TypeInstPtr* cast_to_exactness(bool klass_is_exact) const; and // Speculative type helper methods. - virtual const Type* remove_speculative() const; + virtual const TypeOopPtr* remove_speculative() const; This contradicts "Avoid covariant return types." from [hotspot-style](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md). Is there any particular reason to change like that? I see that compile.cpp leverages that to improve expressiveness. ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From sviswanathan at openjdk.java.net Tue May 17 00:53:50 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 17 May 2022 00:53:50 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v2] In-Reply-To: References: Message-ID: <9aueQTI4qzCTfobf-NKLAWwx_fXh3FhWVg3TwVrlZJs=.2125ee41-f606-4883-a4bc-d0fffa0b3dbd@github.com> On Mon, 16 May 2022 15:25:55 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision: > > - 8283894: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending IR framework testcase with some functional test points. > - 8283894: Intrinsify compress and expand bits on x86 src/hotspot/cpu/x86/x86_32.ad line 11478: > 11476: //----------------------------- CompressBits/ExpandBits ------------------------ > 11477: > 11478: instruct compressBitsL_reg(eADXRegL dst, eBCXRegL src, eBDPRegL mask, eSIRegI rtmp, regF xtmp) %{ eFlags should be added here in effect statement. Same for expand. src/hotspot/cpu/x86/x86_32.ad line 11482: > 11480: match(Set dst (CompressBits src mask)); > 11481: effect(TEMP rtmp, TEMP xtmp); > 11482: format %{ "compress_bits32 $dst, $src, $mask\t! using $rtmp and $xtmp as TEMP" %} We are compressing 64 bits here so the name usage as compress_bits32 would cause confusion. You are probably indicating that this is on 32 bits platform but that distinction is not required. As you are using XMM register and not all 32 bit platforms support XMM, you also need to check for UseSSE > 0 in the predicate. The UseSSE > 0 check is needed for expand as well. src/hotspot/cpu/x86/x86_32.ad line 11552: > 11550: // from lower source register. > 11551: __ bind(mask_clipping); > 11552: __ blsrl($mask$$Register, $mask$$Register); Need to add check for BMI1 support in predicate. src/hotspot/share/opto/intrinsicnode.cpp line 275: > 273: mask >>= 1; > 274: } > 275: return TypeInteger::make(res, res, w, bt); For bt == T_INT res needs to be sign extended properly. Otherwise the checked_cast in TypeInteger::make() would assert if the bit 31 (MSB) is set in the res. Both jint and jlong are signed types. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Tue May 17 01:29:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 01:29:38 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 21:27:30 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with 11 additional commits since the last revision: > > - revert code change from 1st revision. > - Merge branch 'JDK-8276998' into JDK-8286104 > - rule out if a If nodes has 2 branches of unstable_if trap. > - change the flag to diagnostic. > - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull > - fix release build > - update unstable_if after igvn. > - adjust unstable_if after fold_compares > - disable comparison_folding temporarily. > > This feature not only folds two CMPI but also merge two uncommon_traps. > it uses the dominating uncommon_trap and revaluate the two if in > interpreter. currently, aggressiveliveness can't work for that. > - retain bci for unstable_if > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/2c38b87b...2f047457 We can work with this approach. Did you consider to record uncommon traps in `adjust_map_after_if()` instead of `If` node in its `Ideal()` method? And using new `CallStaticJavaNode::_unc_bci` instead of one in `IfNode`. I am suggesting it because it looks logical to track uncommon traps instead of Ifs for this optimization. And `process_for_unstable_ifs()` mostly process information from uncommon traps. I am not sure why you decided to record `IfNode`. May be it was much simpler way when you need to reset them when uncommon traps are merged in two places you do `set_unc_bci()`. But you can do the same for uncommon traps there. You need also make sure uncommon trpa call nodes are removed from the list when they are removed from code. Also `uncomon_trap_proj()` may return merged uncommon trap (referenced through Region node). Instead you can check if it simple `If->Proj->unc_trap` pattern to filter out complex cases. You call `process_for_unstable_ifs` in 2 places. I understand why you want to do that before `inline_boxing_calls()`. But why before `inline_incrementally()`? Why you need second call? ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Tue May 17 01:58:42 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 17 May 2022 01:58:42 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. Thanks a lot for taking a look at this. I am considering this option, too. The problem is that `NoOvfDivI` does not only depend on the zero-divisor check but a possible overflow check as well. So with this fix it is still possible for a `SIGFPE` to occur. IIUC this trouble comes from the fact that on x86 a `Div` node must be pinned to its zero-divisor check but may float with regards to other control nodes. Maybe we can remove all this special handling and simply catch `SIGFPE` instead? The result is guaranteed to not be used in those cases so we may not worry about the correctness of the compiled code. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From njian at openjdk.java.net Tue May 17 02:07:39 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 17 May 2022 02:07:39 GMT Subject: RFR: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE In-Reply-To: References: Message-ID: On Mon, 16 May 2022 07:21:27 GMT, Ningsheng Jian wrote: > This is the REDO of JDK-8269559 and JDK-8275448. Those two backouts finally turned to be some system zlib issue in AArch64 macOS, and is not related to the patch itself. See [1][2] for details. > > This patch is generally the same as JDK-8275448, which uses SVE to optimize string_compare intrinsics for long string comparisons. I did a rebase with small tweaks to get better performance on recent Neoverse hardware. Test data on systems with different SVE vector sizes: > > > case delta size 128-bits 256-bits 512-bits > compareToLL 2 24 0.17% 0.58% 0.00% > compareToLL 2 36 0.00% 2.25% 0.04% > compareToLL 2 72 -4.40% 3.87% -12.82% > compareToLL 2 128 4.55% 58.31% 13.53% > compareToLL 2 256 19.39% 69.77% 82.03% > compareToLL 2 512 1.81% 68.38% 170.93% > compareToLU 2 24 25.57% 46.98% 54.61% > compareToLU 2 36 36.03% 70.26% 94.33% > compareToLU 2 72 35.86% 90.58% 146.04% > compareToLU 2 128 70.82% 119.19% 266.22% > compareToLU 2 256 80.77% 146.33% 420.01% > compareToLU 2 512 94.62% 171.72% 530.87% > compareToUL 2 24 20.82% 34.48% 62.14% > compareToUL 2 36 39.77% 60.79% 69.77% > compareToUL 2 72 35.46% 84.34% 121.90% > compareToUL 2 128 67.77% 110.97% 220.53% > compareToUL 2 256 77.05% 160.29% 331.30% > compareToUL 2 512 91.88% 184.57% 524.21% > compareToUU 2 24 -0.13% 0.40% 0.00% > compareToUU 2 36 -9.18% 12.84% -13.93% > compareToUU 2 72 1.67% 60.61% 6.69% > compareToUU 2 128 13.51% 60.33% 55.27% > compareToUU 2 256 2.55% 62.17% 153.26% > compareToUU 2 512 4.12% 68.62% 201.68% > > JTreg tests passed on SVE hardware. > > [1] https://bugs.openjdk.java.net/browse/JDK-8275448 > [2] https://bugs.openjdk.java.net/browse/JDK-8282954 @TobiHartmann Could you please help to test this patch in Oracle test system? Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8723 From duke at openjdk.java.net Tue May 17 03:19:24 2022 From: duke at openjdk.java.net (Haomin) Date: Tue, 17 May 2022 03:19:24 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short Message-ID: static void test_fun(byte[] a0, int[] b0, byte[] c0) { for (int i=0; i>> (-7)); } } when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. It's executed on x86 would create an assert error. # # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 # assert(false) failed: not supported: byte # RotateRightV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. ------------- Commit messages: - 8286847: Rotate vectors don't support byte or short Changes: https://git.openjdk.java.net/jdk/pull/8740/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286847 Stats: 224 lines in 3 files changed: 222 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8740.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8740/head:pull/8740 PR: https://git.openjdk.java.net/jdk/pull/8740 From kvn at openjdk.java.net Tue May 17 03:38:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 03:38:47 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Mon, 16 May 2022 20:10:29 GMT, Dean Long wrote: >> This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set Regarding @sviswa7 question. The comment in [sharedRuntime_x86_64.cpp#L458](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L458) says: `16 bytes XMM registers are saved by default using fxsave/fxrstor instructions.` That is why we did not care about saving 128 bit xmm registers before AVX512. Unfortunately `fxsave` saves only `xmm0-xmm15`. So we save `xmm16-xmm31` manually in the code Dean is fixing. But we save only 64-bits before. What I was surprise that there is no evex instruction to save only 128 bit of `xmm15-31` registers if `avx512vl` is not supported. I see specific asserts regarding that: [macroAssembler_x86.cpp#L2561](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L2561) ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From duke at openjdk.java.net Tue May 17 04:30:18 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 17 May 2022 04:30:18 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite, IsInfinite [v2] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: - Merge branch 'master' into float - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite ------------- Changes: https://git.openjdk.java.net/jdk/pull/8459/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=01 Stats: 750 lines in 20 files changed: 748 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From thartmann at openjdk.java.net Tue May 17 06:12:54 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 17 May 2022 06:12:54 GMT Subject: RFR: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE In-Reply-To: References: Message-ID: On Mon, 16 May 2022 07:21:27 GMT, Ningsheng Jian wrote: > This is the REDO of JDK-8269559 and JDK-8275448. Those two backouts finally turned to be some system zlib issue in AArch64 macOS, and is not related to the patch itself. See [1][2] for details. > > This patch is generally the same as JDK-8275448, which uses SVE to optimize string_compare intrinsics for long string comparisons. I did a rebase with small tweaks to get better performance on recent Neoverse hardware. Test data on systems with different SVE vector sizes: > > > case delta size 128-bits 256-bits 512-bits > compareToLL 2 24 0.17% 0.58% 0.00% > compareToLL 2 36 0.00% 2.25% 0.04% > compareToLL 2 72 -4.40% 3.87% -12.82% > compareToLL 2 128 4.55% 58.31% 13.53% > compareToLL 2 256 19.39% 69.77% 82.03% > compareToLL 2 512 1.81% 68.38% 170.93% > compareToLU 2 24 25.57% 46.98% 54.61% > compareToLU 2 36 36.03% 70.26% 94.33% > compareToLU 2 72 35.86% 90.58% 146.04% > compareToLU 2 128 70.82% 119.19% 266.22% > compareToLU 2 256 80.77% 146.33% 420.01% > compareToLU 2 512 94.62% 171.72% 530.87% > compareToUL 2 24 20.82% 34.48% 62.14% > compareToUL 2 36 39.77% 60.79% 69.77% > compareToUL 2 72 35.46% 84.34% 121.90% > compareToUL 2 128 67.77% 110.97% 220.53% > compareToUL 2 256 77.05% 160.29% 331.30% > compareToUL 2 512 91.88% 184.57% 524.21% > compareToUU 2 24 -0.13% 0.40% 0.00% > compareToUU 2 36 -9.18% 12.84% -13.93% > compareToUU 2 72 1.67% 60.61% 6.69% > compareToUU 2 128 13.51% 60.33% 55.27% > compareToUU 2 256 2.55% 62.17% 153.26% > compareToUU 2 512 4.12% 68.62% 201.68% > > JTreg tests passed on SVE hardware. > > [1] https://bugs.openjdk.java.net/browse/JDK-8275448 > [2] https://bugs.openjdk.java.net/browse/JDK-8282954 Marked as reviewed by thartmann (Reviewer). Sure, I already submitted testing yesterday, it all passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8723 From njian at openjdk.java.net Tue May 17 06:28:48 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 17 May 2022 06:28:48 GMT Subject: RFR: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE In-Reply-To: References: Message-ID: On Tue, 17 May 2022 06:09:11 GMT, Tobias Hartmann wrote: > Sure, I already submitted testing yesterday, it all passed. Thank you, Tobias! ------------- PR: https://git.openjdk.java.net/jdk/pull/8723 From eliu at openjdk.java.net Tue May 17 06:56:50 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 17 May 2022 06:56:50 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short In-Reply-To: References: Message-ID: On Tue, 17 May 2022 03:09:12 GMT, Haomin wrote: > static void test_fun(byte[] a0, int[] b0, byte[] c0) { > for (int i=0; i c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); > } > } > > > when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. > > It's executed on x86 would create an assert error. > > > # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 > # assert(false) failed: not supported: byte > > > RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. test/hotspot/jtreg/compiler/vectorization/TestRotateByteVector.java line 90: > 88: res[i] = (byte) ((arr[i] << shift) | (arr[i] >>> -shift)); > 89: } > 90: } This function is duplicated with with `testRotateLeft`. It's not enough to verify the correctness by comparing the results of these two functions. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Tue May 17 07:01:20 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 17 May 2022 07:01:20 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v3] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: update jmh tests ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/79d407bd..f4769cd3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=01-02 Stats: 6 lines in 2 files changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Tue May 17 07:14:34 2022 From: duke at openjdk.java.net (Haomin) Date: Tue, 17 May 2022 07:14:34 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short In-Reply-To: References: Message-ID: <0CPtvcY81Q2R9_6iONGV87qAJ4OH5izCGIAByWBBqSM=.5fd06621-8ecb-4788-ac3e-d5cd29784236@github.com> On Tue, 17 May 2022 06:53:31 GMT, Eric Liu wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > test/hotspot/jtreg/compiler/vectorization/TestRotateByteVector.java line 90: > >> 88: res[i] = (byte) ((arr[i] << shift) | (arr[i] >>> -shift)); >> 89: } >> 90: } > > This function is duplicated with with `testRotateLeft`. It's not enough to verify the correctness by comparing the results of these two functions. Yes, `rotateLeftRes` is duplicated with `testRotateLeft`. But compile only `testRotateLeft`, and then compare the result between the two function. Could you give me some suggestions ? How should I modify this ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From jbhateja at openjdk.java.net Tue May 17 08:11:37 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 17 May 2022 08:11:37 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v5] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Review comments resolution. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Correcting a typo. - 8284960: Integrating changes from panama-vector (Add @since 19 tags). - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: AARCH64 backend changes. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - ... and 3 more: https://git.openjdk.java.net/jdk/compare/5e5500cb...df7eb90e ------------- Changes: https://git.openjdk.java.net/jdk/pull/8425/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=04 Stats: 38068 lines in 254 files changed: 16705 ins; 16921 del; 4442 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From mcimadamore at openjdk.java.net Tue May 17 08:31:03 2022 From: mcimadamore at openjdk.java.net (Maurizio Cimadamore) Date: Tue, 17 May 2022 08:31:03 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v21] In-Reply-To: <-H-hj0CmArcV48YOFCPUC1yIZXJxZ41p9PGBP-E_Vc0=.dad63c91-0e01-4124-9b44-467134a26b75@github.com> References: <-H-hj0CmArcV48YOFCPUC1yIZXJxZ41p9PGBP-E_Vc0=.dad63c91-0e01-4124-9b44-467134a26b75@github.com> Message-ID: On Mon, 16 May 2022 16:15:49 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > Missing ASSERT -> NOT_PRODUCT src/java.base/share/classes/jdk/internal/foreign/abi/ProgrammableInvoker.java line 66: > 64: private static final boolean USE_SPEC = Boolean.parseBoolean( > 65: GetPropertyAction.privilegedGetProperty("jdk.internal.foreign.ProgrammableInvoker.USE_SPEC", "true")); > 66: private static final boolean USE_INTRINSICS = Boolean.parseBoolean( Do we need to update TestMatrix given that we're removing one dimension in the invokers? ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From ngasson at openjdk.java.net Tue May 17 08:40:56 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 17 May 2022 08:40:56 GMT Subject: RFR: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE In-Reply-To: References: Message-ID: On Mon, 16 May 2022 07:21:27 GMT, Ningsheng Jian wrote: > This is the REDO of JDK-8269559 and JDK-8275448. Those two backouts finally turned to be some system zlib issue in AArch64 macOS, and is not related to the patch itself. See [1][2] for details. > > This patch is generally the same as JDK-8275448, which uses SVE to optimize string_compare intrinsics for long string comparisons. I did a rebase with small tweaks to get better performance on recent Neoverse hardware. Test data on systems with different SVE vector sizes: > > > case delta size 128-bits 256-bits 512-bits > compareToLL 2 24 0.17% 0.58% 0.00% > compareToLL 2 36 0.00% 2.25% 0.04% > compareToLL 2 72 -4.40% 3.87% -12.82% > compareToLL 2 128 4.55% 58.31% 13.53% > compareToLL 2 256 19.39% 69.77% 82.03% > compareToLL 2 512 1.81% 68.38% 170.93% > compareToLU 2 24 25.57% 46.98% 54.61% > compareToLU 2 36 36.03% 70.26% 94.33% > compareToLU 2 72 35.86% 90.58% 146.04% > compareToLU 2 128 70.82% 119.19% 266.22% > compareToLU 2 256 80.77% 146.33% 420.01% > compareToLU 2 512 94.62% 171.72% 530.87% > compareToUL 2 24 20.82% 34.48% 62.14% > compareToUL 2 36 39.77% 60.79% 69.77% > compareToUL 2 72 35.46% 84.34% 121.90% > compareToUL 2 128 67.77% 110.97% 220.53% > compareToUL 2 256 77.05% 160.29% 331.30% > compareToUL 2 512 91.88% 184.57% 524.21% > compareToUU 2 24 -0.13% 0.40% 0.00% > compareToUU 2 36 -9.18% 12.84% -13.93% > compareToUU 2 72 1.67% 60.61% 6.69% > compareToUU 2 128 13.51% 60.33% 55.27% > compareToUU 2 256 2.55% 62.17% 153.26% > compareToUU 2 512 4.12% 68.62% 201.68% > > JTreg tests passed on SVE hardware. > > [1] https://bugs.openjdk.java.net/browse/JDK-8275448 > [2] https://bugs.openjdk.java.net/browse/JDK-8282954 LGTM! ------------- Marked as reviewed by ngasson (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8723 From eliu at openjdk.java.net Tue May 17 08:49:54 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Tue, 17 May 2022 08:49:54 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short In-Reply-To: <0CPtvcY81Q2R9_6iONGV87qAJ4OH5izCGIAByWBBqSM=.5fd06621-8ecb-4788-ac3e-d5cd29784236@github.com> References: <0CPtvcY81Q2R9_6iONGV87qAJ4OH5izCGIAByWBBqSM=.5fd06621-8ecb-4788-ac3e-d5cd29784236@github.com> Message-ID: On Tue, 17 May 2022 07:11:29 GMT, Haomin wrote: >> test/hotspot/jtreg/compiler/vectorization/TestRotateByteVector.java line 90: >> >>> 88: res[i] = (byte) ((arr[i] << shift) | (arr[i] >>> -shift)); >>> 89: } >>> 90: } >> >> This function is duplicated with with `testRotateLeft`. It's not enough to verify the correctness by comparing the results of these two functions. > > Yes, `rotateLeftRes` is duplicated with `testRotateLeft`. But compile only `testRotateLeft`, and then compare the result between the two function. Could you give me some suggestions ? How should I modify this ? Two options off the top of my head: a) With `-Xcomp`, hardcode the expected values which are calculated by interpreter or just by hand and compared with the results of C2. b) With `-XX:-TieredCompilation`, generate the expected results if the method was not hot enough to compiled by C2. Then compared with C2 results, which can be got when the iteration count is more than 10K. Please refer to https://github.com/openjdk/jdk/pull/5403/files/71aa6ac439b67b27828e9aabe51845fa34602837#diff-d14ca09ba5fa806904a4db333037a14621cfeba81505ba369a375c57bd90c7a8 for more detail. Option a) is much more simpler. Option b) can verify more random values. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From shade at openjdk.java.net Tue May 17 08:52:56 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 17 May 2022 08:52:56 GMT Subject: Integrated: 8286660: codestrings gtest fails on AArch64: "udf" in padding In-Reply-To: References: Message-ID: On Fri, 13 May 2022 09:10:09 GMT, Aleksey Shipilev wrote: > On hsdis-enabled AArch64 machine this test fails even with [JDK-8274039](https://bugs.openjdk.java.net/browse/JDK-8274039): > > > $ CONF=linux-aarch64-server-fastdebug make run-test TEST=jtreg:gtest/GTestWrapper.java > > [----------] 1 test from codestrings > [ RUN ] codestrings.validate_vm This pull request has now been integrated. Changeset: 63cace75 Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/63cace759ee0a913536171d1e498decb517cc71a Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8286660: codestrings gtest fails on AArch64: "udf" in padding Reviewed-by: ngasson, aph ------------- PR: https://git.openjdk.java.net/jdk/pull/8695 From aph-open at littlepinkcloud.com Tue May 17 08:54:25 2022 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Tue, 17 May 2022 09:54:25 +0100 Subject: -XX:+StressCodeBuffers Message-ID: <6d4f0ce3-1970-d6ee-5914-46353afcf821@littlepinkcloud.com> Working on 8272094: compiler/codecache/TestStressCodeBuffers.java crashes..., I found that it took so long to reproduce the problem, even with many parallel runs, that I gave up. Even after hours it didn't show up. Output from StressCodeBuffers looks like this: StressCodeBuffers: have expanded 128 times StressCodeBuffers: have expanded 256 times StressCodeBuffers: have expanded 512 times StressCodeBuffers: have expanded 1024 times StressCodeBuffers: have expanded 2048 times Each time this message is printed, a code buffer allocation fails. So it only tests code buffer allocation failure a few times. I changed the logic (patch below) to make codeBuffer allocation fail much more frequently, every 40 times, and "Boom!" I reproduced the bug immediately. Every run, in fact. So, my question is: why is -XX:+StressCodeBuffers so very gentle? It seems to me like it's not even trying to stress the system. diff --git a/src/hotspot/share/asm/codeBuffer.cpp b/src/hotspot/share/asm/codeBuffer.cpp index ddd946d7542..c74ba21cf63 100644 --- a/src/hotspot/share/asm/codeBuffer.cpp +++ b/src/hotspot/share/asm/codeBuffer.cpp @@ -837,7 +837,8 @@ void CodeBuffer::expand(CodeSection* which_cs, csize_t amount) { if (StressCodeBuffers && blob() != NULL) { static int expand_count = 0; if (expand_count >= 0) expand_count += 1; - if (expand_count > 100 && is_power_of_2(expand_count)) { + if (expand_count > 100 && // is_power_of_2(expand_count) + expand_count % 40 == 0) { tty->print_cr("StressCodeBuffers: have expanded %d times", expand_count); // simulate an occasional allocation failure: free_blob(); -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From chagedorn at openjdk.java.net Tue May 17 09:06:39 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 17 May 2022 09:06:39 GMT Subject: RFR: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly In-Reply-To: References: Message-ID: On Fri, 13 May 2022 07:45:18 GMT, Christian Hagedorn wrote: > This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. > > In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: > > > # PrintIdeal output of retainDenominator() > 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} > 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator > > # Safepoint interruption > > > # Enqueuing of another test class method identityThird() > > > @ bci:4 (line 130) > > # Continue to dump PrintIdeal of retainDenominator() > 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) > 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) > > > The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. > > In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. > > The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. > > Thanks, > Christian Thanks Tobias and Vladimir for your reviews! If we find more problems in the future or other limitations, we can still come back to this question if we want to introduce a new file. ------------- PR: https://git.openjdk.java.net/jdk/pull/8692 From shade at redhat.com Tue May 17 09:12:31 2022 From: shade at redhat.com (Aleksey Shipilev) Date: Tue, 17 May 2022 11:12:31 +0200 Subject: -XX:+StressCodeBuffers In-Reply-To: <6d4f0ce3-1970-d6ee-5914-46353afcf821@littlepinkcloud.com> References: <6d4f0ce3-1970-d6ee-5914-46353afcf821@littlepinkcloud.com> Message-ID: <9ef2c1ec-46c9-ab91-0464-f9a48d9605e2@redhat.com> On 5/17/22 10:54, Andrew Haley wrote: > Working on 8272094: compiler/codecache/TestStressCodeBuffers.java > crashes..., I found that it took so long to reproduce the problem, > even with many parallel runs, that I gave up. Even after hours it > didn't show up. > > Output from StressCodeBuffers looks like this: > > StressCodeBuffers: have expanded 128 times > StressCodeBuffers: have expanded 256 times > StressCodeBuffers: have expanded 512 times > StressCodeBuffers: have expanded 1024 times > StressCodeBuffers: have expanded 2048 times > > Each time this message is printed, a code buffer allocation fails. So > it only tests code buffer allocation failure a few times. > > I changed the logic (patch below) to make codeBuffer allocation fail > much more frequently, every 40 times, and "Boom!" I reproduced the > bug immediately. Every run, in fact. > > So, my question is: why is -XX:+StressCodeBuffers so very gentle? It > seems to me like it's not even trying to stress the system. Good question. In fact, it is strange to see exponential backoff in stress option. I suspect it was that way because it was enough to manifest some other bug at the time. Given the evidence that a more frequent allocation failure is beneficial for testing, there should be no problem in bumping the frequency. ( > diff --git a/src/hotspot/share/asm/codeBuffer.cpp b/src/hotspot/share/asm/codeBuffer.cpp > index ddd946d7542..c74ba21cf63 100644 > --- a/src/hotspot/share/asm/codeBuffer.cpp > +++ b/src/hotspot/share/asm/codeBuffer.cpp > @@ -837,7 +837,8 @@ void CodeBuffer::expand(CodeSection* which_cs, csize_t amount) { > if (StressCodeBuffers && blob() != NULL) { > static int expand_count = 0; > if (expand_count >= 0) expand_count += 1; > - if (expand_count > 100 && is_power_of_2(expand_count)) { > + if (expand_count > 100 && // is_power_of_2(expand_count) > + expand_count % 40 == 0) { > tty->print_cr("StressCodeBuffers: have expanded %d times", expand_count); > // simulate an occasional allocation failure: > free_blob(); In fact, that whole code could be simplified to just e.g.: if (StressCodeBuffers && blob() != NULL) { static int expand_count = 0; if ((++expand_count % 100) == 0) { tty->print_cr("StressCodeBuffers: have expanded %d times", expand_count); // simulate an occasional allocation failure: free_blob(); } } -- Thanks, -Aleksey From aph-open at littlepinkcloud.com Tue May 17 09:26:45 2022 From: aph-open at littlepinkcloud.com (Andrew Haley) Date: Tue, 17 May 2022 10:26:45 +0100 Subject: -XX:+StressCodeBuffers In-Reply-To: <9ef2c1ec-46c9-ab91-0464-f9a48d9605e2@redhat.com> References: <6d4f0ce3-1970-d6ee-5914-46353afcf821@littlepinkcloud.com> <9ef2c1ec-46c9-ab91-0464-f9a48d9605e2@redhat.com> Message-ID: On 5/17/22 10:12, Aleksey Shipilev wrote: > Good question. In fact, it is strange to see exponential backoff in stress option. Yep. I've never seen that before. > I suspect it was that way because it was enough to manifest some other bug at the time. Given the > evidence that a more frequent allocation failure is beneficial for testing, there should be no > problem in bumping the frequency. ( There are quite a few tests using StressCodeBuffers, and they would definitely run much more slowly, and produce more output. I don't know if this would be a problem. -- Andrew Haley (he/him) Java Platform Lead Engineer Red Hat UK Ltd. https://keybase.io/andrewhaley EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671 From jvernee at openjdk.java.net Tue May 17 10:06:05 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Tue, 17 May 2022 10:06:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v21] In-Reply-To: References: <-H-hj0CmArcV48YOFCPUC1yIZXJxZ41p9PGBP-E_Vc0=.dad63c91-0e01-4124-9b44-467134a26b75@github.com> Message-ID: On Tue, 17 May 2022 08:27:41 GMT, Maurizio Cimadamore wrote: >> Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: >> >> Missing ASSERT -> NOT_PRODUCT > > src/java.base/share/classes/jdk/internal/foreign/abi/ProgrammableInvoker.java line 66: > >> 64: private static final boolean USE_SPEC = Boolean.parseBoolean( >> 65: GetPropertyAction.privilegedGetProperty("jdk.internal.foreign.ProgrammableInvoker.USE_SPEC", "true")); >> 66: private static final boolean USE_INTRINSICS = Boolean.parseBoolean( > > Do we need to update TestMatrix given that we're removing one dimension in the invokers? Looks like that already happened as part of the main JEP integration. ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From jvernee at openjdk.java.net Tue May 17 10:38:39 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Tue, 17 May 2022 10:38:39 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v22] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: ifdef NOT_PRODUCT -> ifndef PRODUCT ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7959/files - new: https://git.openjdk.java.net/jdk/pull/7959/files/406f3e83..c3abb732 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=21 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=20-21 Stats: 6 lines in 4 files changed: 0 ins; 0 del; 6 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From chagedorn at openjdk.java.net Tue May 17 11:26:59 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 17 May 2022 11:26:59 GMT Subject: Integrated: 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly In-Reply-To: References: Message-ID: On Fri, 13 May 2022 07:45:18 GMT, Christian Hagedorn wrote: > This is yet another manifestation of the safepointing problem while printing a `PrintIdeal/PrintOptoAssembly` block. > > In this case here, a safepoint is done and the `` message is emitted while dumping a `PrintIdeal` block of `retainDenominator()` inside the `hotspot_pid` file. During this interruption, another test class method is enqueued for compilation which is logged to the `hotspot_pid` file before the printing of the `PrintIdeal` block resumes: > > > # PrintIdeal output of retainDenominator() > 3 Start === 3 0 [[ 3 5 6 7 8 9 13 11 ]] #{0:control, 1:abIO, 2:memory, 3:rawptr:BotPTR, 4:return_address, 5:compiler/c2/irTests/DivLNodeIdealizationTests:NotNull *, 6:long, 7:half, 8:long, 9:half} > 36 CallStaticJava === 34 6 7 8 9 ( 35 1 1 1 1 1 26 1 27 1 ) [[ 37 ]] # Static uncommon_trap(reason='div0_check' action='maybe_recompile' debug_id='0') void ( int ) C=0.000100 DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) !jvms: DivLNodeIdealizationTests::retainDenominator > > # Safepoint interruption > > > # Enqueuing of another test class method identityThird() > > > @ bci:4 (line 130) > > # Continue to dump PrintIdeal of retainDenominator() > 41 DivL === 33 26 13 [[ 42 ]] !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:4 (line 130) > 9 Parm === 3 [[ 42 36 ]] ReturnAdr !jvms: DivLNodeIdealizationTests::retainDenominator @ bci:-1 (line 130) > > > The `HotSpotPidFileParser` looks for these enqueue messages containing the method name in order to find and correctly map the corresponding `PrintIdeal` and `PrintOptoAssembly` outputs. However, the `HotSpotPidFileParser` does not expect such an enqueuing message to be found inside a `PrintIdeal/PrintOptoAssembly` block and thus ignores it. As a result, we later do not parse the `PrintIdeal` and `PrintOptoAssembly` output of the enqueued method during the safepoint and fail with the assertion that we did not find any compilation output for the method. > > In the example above, the assertion says that we did not find the compilation output of `identityThird()` whose enqueue message was ignored inside the `PrintIdeal` block of `retainDominator()`. > > The proposed fix is to make `HotSpotPidFileParser` aware of the possibility of a safepoint while reading the `PrintIdeal` or `PrintOptoAssembly` output and therefore add a check if there was a method enqueued for compilation while reading inside `BlockOutputReader::readBlock()`. > > Thanks, > Christian This pull request has now been integrated. Changeset: 39842538 Author: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/39842538004c5fca57701070484c78cacf95ed64 Stats: 160 lines in 6 files changed: 109 ins; 40 del; 11 mod 8284115: [IR Framework] Compilation is not found due to rare safepoint while dumping PrintIdeal/PrintOptoAssembly Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.java.net/jdk/pull/8692 From thartmann at openjdk.java.net Tue May 17 11:56:21 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Tue, 17 May 2022 11:56:21 GMT Subject: RFR: 8286870: Memory leak with RepeatCompilation Message-ID: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> While using `RepeatCompilation` in combination with replay compilation and stress options to reproduce an intermittent issue, I noticed that it does not free the compiler thread arena after each compilation, leading to a (temporary) memory leak and out of memory errors. For example, each compilation of [JDK-8280696](https://bugs.openjdk.java.net/browse/JDK-8280696) allocates an additional 1218 kB. The fix is to simply add a `ResourceMark` in the loop. Thanks, Tobias ------------- Commit messages: - 8286870: Memory leak with RepeatCompilation Changes: https://git.openjdk.java.net/jdk/pull/8744/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8744&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286870 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8744.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8744/head:pull/8744 PR: https://git.openjdk.java.net/jdk/pull/8744 From rehn at openjdk.java.net Tue May 17 12:27:11 2022 From: rehn at openjdk.java.net (Robbin Ehn) Date: Tue, 17 May 2022 12:27:11 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v22] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 10:38:39 GMT, Jorn Vernee wrote: >> Hi, >> >> This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. >> >> This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. >> >> I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. >> >> This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: >> >> 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. >> 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. >> 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). >> 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. >> >> While the patch mostly consists of VM changes, there are also some Java changes to support (2). >> >> The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. >> >> Testing: Tier1-4 >> >> Thanks, >> Jorn >> >> [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 >> [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 >> [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 >> [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 >> [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 >> [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a >> [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 >> [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f >> [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 > > Jorn Vernee has updated the pull request incrementally with one additional commit since the last revision: > > ifdef NOT_PRODUCT -> ifndef PRODUCT Looks good, thanks. ------------- Marked as reviewed by rehn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/7959 From njian at openjdk.java.net Tue May 17 13:00:49 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Tue, 17 May 2022 13:00:49 GMT Subject: RFR: 8286596: AArch64: -XX:UseBranchProtection=pac-ret crashes after JDK-8284161 In-Reply-To: References: Message-ID: <3BSwcMMNNzTCc66ZMY_5tw8zFHfu8r4eNZAl6J_eOhM=.a13f916e-382f-4e7c-b31c-909e62042cc0@github.com> On Thu, 12 May 2022 14:52:18 GMT, Nick Gasson wrote: > `RegisterSaver::restore_live_registers()` used to call `__ leave()` but after the Loom integration it directly pops LR/FP from the stack. With `-XX:UseBranchProtection=pac-ret` we need a call to `__ authenticate_return_address()` here to insert the AUTIA instruction to check and strip the PAC code from the saved LR. > > Tested `java -XX:UseBranchProtection=pac-ret -version` on a machine that supports PAC, plus tier1. Note that some additional fixes will be required to support virtual threads with PAC enabled. Looks good! ------------- Marked as reviewed by njian (Committer). PR: https://git.openjdk.java.net/jdk/pull/8682 From mdoerr at openjdk.java.net Tue May 17 14:32:03 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Tue, 17 May 2022 14:32:03 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. Right, I had forgotten that x86 also raises SIGFPE in case of overflow. Catching SIGFPE is possible. I had experimented with --- a/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp +++ b/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp @@ -278,6 +278,15 @@ bool PosixSignals::pd_hotspot_signal_handler(int sig, siginfo_t* info, pc, SharedRuntime:: IMPLICIT_DIVIDE_BY_ZERO); + if (stub == nullptr) { + int op = pc[0]; + // TODO: make sure to handle all variants used by C2! + if (op == 0xF7) { + // ignore SIGPFE by speculative div + uc->uc_mcontext.gregs[REG_PC] += 4; + return true; + } + } #else if (sig == SIGFPE /* && info->si_code == FPE_INTDIV */) { // HACK: si_code does not work on linux 2.2.12-20!!! diff --git a/src/hotspot/share/code/compiledMethod.cpp b/src/hotspot/share/code/compiledMethod.cpp index 83c33408ea3..1c5e197511d 100644 --- a/src/hotspot/share/code/compiledMethod.cpp +++ b/src/hotspot/share/code/compiledMethod.cpp @@ -746,7 +746,7 @@ address CompiledMethod::continuation_for_implicit_exception(address pc, bool for int exception_offset = pc - code_begin(); int cont_offset = ImplicitExceptionTable(this).continuation_offset( exception_offset ); #ifdef ASSERT - if (cont_offset == 0) { + if (cont_offset == 0 && !for_div0_check) { Thread* thread = Thread::current(); ResourceMark rm(thread); CodeBlob* cb = CodeCache::find_blob(pc); This works and behaves like PPC64. However, we need to make sure it doesn't show up on any hot path. Otherwise, performance will be terrible. Feel free to create a new PR if you want to propose a solution. I only opened this one to show my findings and can close it again if we decide for something else. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From ngasson at openjdk.java.net Tue May 17 15:15:50 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 17 May 2022 15:15:50 GMT Subject: Integrated: 8286596: AArch64: -XX:UseBranchProtection=pac-ret crashes after JDK-8284161 In-Reply-To: References: Message-ID: On Thu, 12 May 2022 14:52:18 GMT, Nick Gasson wrote: > `RegisterSaver::restore_live_registers()` used to call `__ leave()` but after the Loom integration it directly pops LR/FP from the stack. With `-XX:UseBranchProtection=pac-ret` we need a call to `__ authenticate_return_address()` here to insert the AUTIA instruction to check and strip the PAC code from the saved LR. > > Tested `java -XX:UseBranchProtection=pac-ret -version` on a machine that supports PAC, plus tier1. Note that some additional fixes will be required to support virtual threads with PAC enabled. This pull request has now been integrated. Changeset: 87d9d7f5 Author: Nick Gasson URL: https://git.openjdk.java.net/jdk/commit/87d9d7f54207b00ffea510f16930f38a64b612d9 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod 8286596: AArch64: -XX:UseBranchProtection=pac-ret crashes after JDK-8284161 Co-authored-by: Alan Hayward Reviewed-by: aph, njian ------------- PR: https://git.openjdk.java.net/jdk/pull/8682 From jvernee at openjdk.java.net Tue May 17 15:53:05 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Tue, 17 May 2022 15:53:05 GMT Subject: RFR: 8283689: Update the foreign linker VM implementation [v23] In-Reply-To: References: Message-ID: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 Jorn Vernee has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 105 commits: - Merge branch 'master' into JEP-19-VM-IMPL2 - ifdef NOT_PRODUCT -> ifndef PRODUCT - Missing ASSERT -> NOT_PRODUCT - Cleanup UL usage - Fix failure with SPEC disabled (accidentally dropped change) - indentation - fix space - Merge branch 'master' into JEP-19-VM-IMPL2 - Undo spurious changes. - Merge branch 'JEP-19-VM-IMPL2' of https://github.com/JornVernee/jdk into JEP-19-VM-IMPL2 - ... and 95 more: https://git.openjdk.java.net/jdk/compare/af07919e...c3c1421b ------------- Changes: https://git.openjdk.java.net/jdk/pull/7959/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=7959&range=22 Stats: 6914 lines in 155 files changed: 2577 ins; 3219 del; 1118 mod Patch: https://git.openjdk.java.net/jdk/pull/7959.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7959/head:pull/7959 PR: https://git.openjdk.java.net/jdk/pull/7959 From kvn at openjdk.java.net Tue May 17 16:19:01 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 16:19:01 GMT Subject: RFR: 8286870: Memory leak with RepeatCompilation In-Reply-To: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> References: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> Message-ID: On Tue, 17 May 2022 11:50:38 GMT, Tobias Hartmann wrote: > While using `RepeatCompilation` in combination with replay compilation and stress options to reproduce an intermittent issue, I noticed that it does not free the compiler thread arena after each compilation, leading to a (temporary) memory leak and out of memory errors. For example, each compilation of [JDK-8280696](https://bugs.openjdk.java.net/browse/JDK-8280696) allocates an additional 1218 kB. > > The fix is to simply add a `ResourceMark` in the loop. > > Thanks, > Tobias Good. You got github testing failures. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8744 From kvn at openjdk.java.net Tue May 17 16:45:55 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 16:45:55 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:48:40 GMT, Brian J. Stafford wrote: > The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: > > - Asserted the condition that makes sure the code runs at least once > - Asserted the condition that makes sure the code runs only once > - Removed the `while` loop > - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. > - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. > > Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. src/hotspot/share/opto/lcm.cpp line 344: > 342: } > 343: if (k < num_nodes) > 344: continue; // Found anti-dependent load Our code style requires to use {} for `if() {}`. the same for `if` at line 340. When we fixing code we fix its style too if it was wrong. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From mdoerr at openjdk.java.net Tue May 17 17:18:56 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Tue, 17 May 2022 17:18:56 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. Btw. it is possible to make the div node dependent on only one check by using the following scheme for a/b with unsigned comparison: if (b+1 >u 1) return a/b; // <-1 or >0 else if (b==0) arithmetic_exception(); else return -a; // div by -1 It could be transformed back later to enable implicit div by 0 checks. Sounds complicated and I don't know if this makes sense. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From jbhateja at openjdk.java.net Tue May 17 17:29:00 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 17 May 2022 17:29:00 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8284960: Adding --enable-preview in vectorAPI benchmarks. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8425/files - new: https://git.openjdk.java.net/jdk/pull/8425/files/df7eb90e..0b7f84bb Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=04-05 Stats: 21 lines in 10 files changed: 7 ins; 4 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From chagedorn at openjdk.java.net Tue May 17 17:44:49 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 17 May 2022 17:44:49 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. The fix looks reasonable to address the same problems in [JDK-8257822](https://bugs.openjdk.java.net/browse/JDK-8257822) and [JDK-8248552](https://bugs.openjdk.java.net/browse/JDK-8248552) for the new nodes `Div` and `Mod` nodes. > The problem is that NoOvfDivI does not only depend on the zero-divisor check but a possible overflow check as well. So with this fix it is still possible for a SIGFPE to occur. Do you have a failing test for this case? We've been seeing a lot of SIGFPE failures lately with Java Fuzzer. I have to walk through them to see if I can find a case that is still failing with this fix. Will get back with the result of this analysis. > IIUC this trouble comes from the fact that on x86 a Div node must be pinned to its zero-divisor check but may float with regards to other control nodes. Maybe we can remove all this special handling and simply catch SIGFPE instead? The result is guaranteed to not be used in those cases so we may not worry about the correctness of the compiled code. I'm not sure if we should rely on signal catching to fix the cases where a division is wrongly floating above its zero check. I think we should not intentionally leave a graph in a broken state with the intention to fix it later at runtime. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From xliu at openjdk.java.net Tue May 17 18:19:48 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 17 May 2022 18:19:48 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 21:27:30 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with 11 additional commits since the last revision: > > - revert code change from 1st revision. > - Merge branch 'JDK-8276998' into JDK-8286104 > - rule out if a If nodes has 2 branches of unstable_if trap. > - change the flag to diagnostic. > - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull > - fix release build > - update unstable_if after igvn. > - adjust unstable_if after fold_compares > - disable comparison_folding temporarily. > > This feature not only folds two CMPI but also merge two uncommon_traps. > it uses the dominating uncommon_trap and revaluate the two if in > interpreter. currently, aggressiveliveness can't work for that. > - retain bci for unstable_if > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/2c38b87b...2f047457 > Did you consider to record uncommon traps in adjust_map_after_if() instead of If node in its Ideal() method? And using new CallStaticJavaNode::_unc_bci instead of one in IfNode. > I am not sure why you decided to record IfNode. I though there are a variety of reasons for uncommon_trap calls, it would be wasteful to add a field just for unstable_if. That's why I add a field 'unc_bci' to IfNode. you are right, I have the same feeling. I will try your idea. > You call process_for_unstable_ifs in 2 places. I understand why you want to do that before inline_boxing_calls(). But why before inline_incrementally()? Why you need second call? The first call secures the microbenchmark score because it can remove boxing objects completely. I think incremental inliner and special inlining such as boxing or string methods could also introduce similar code snippets. I would like to sweep again for them. Is it necessary? ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Tue May 17 18:46:01 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 18:46:01 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 18:16:17 GMT, Xin Liu wrote: > > You call process_for_unstable_ifs in 2 places. I understand why you want to do that before inline_boxing_calls(). But why before inline_incrementally()? Why you need second call? > > The first call secures the microbenchmark score because it can remove boxing objects completely. I think incremental inliner and special inlining such as boxing or string methods could also introduce similar code snippets. I would like to sweep again for them. Is it necessary? After some thinking we may need 3rd call before inlining boxing calls. if (eliminate_boxing()) { + process_for_unstable_ifs(igvn); // incremental inlining may added new unstable ifs. // Inline valueOf() methods now. inline_boxing_calls(igvn); if (AlwaysIncrementalInline) { inline_incrementally(igvn); } ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Tue May 17 19:06:56 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 19:06:56 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 21:27:30 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with 11 additional commits since the last revision: > > - revert code change from 1st revision. > - Merge branch 'JDK-8276998' into JDK-8286104 > - rule out if a If nodes has 2 branches of unstable_if trap. > - change the flag to diagnostic. > - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull > - fix release build > - update unstable_if after igvn. > - adjust unstable_if after fold_compares > - disable comparison_folding temporarily. > > This feature not only folds two CMPI but also merge two uncommon_traps. > it uses the dominating uncommon_trap and revaluate the two if in > interpreter. currently, aggressiveliveness can't work for that. > - retain bci for unstable_if > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/2c38b87b...2f047457 I think you also need to call `inline_incrementally_cleanup()` on exit from `process_for_unstable_ifs()` if it made progress (found dead locals). Or do similar thing to cleanup `*_late_inlines` lists. Placing `IF` node on work list is not enough. ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Tue May 17 20:19:14 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Tue, 17 May 2022 20:19:14 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: Message-ID: > The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: > > - Asserted the condition that makes sure the code runs at least once > - Asserted the condition that makes sure the code runs only once > - Removed the `while` loop > - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. > - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. > > Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. Brian J. Stafford has updated the pull request incrementally with two additional commits since the last revision: - Removed whitespace - Added braces for if statements ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8684/files - new: https://git.openjdk.java.net/jdk/pull/8684/files/6de40386..c055174a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=00-01 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8684.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8684/head:pull/8684 PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Tue May 17 20:19:16 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Tue, 17 May 2022 20:19:16 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 16:33:43 GMT, Vladimir Kozlov wrote: >> Brian J. Stafford has updated the pull request incrementally with two additional commits since the last revision: >> >> - Removed whitespace >> - Added braces for if statements > > src/hotspot/share/opto/lcm.cpp line 344: > >> 342: } >> 343: if (k < num_nodes) >> 344: continue; // Found anti-dependent load > > Our code style requires to use {} for `if() {}`. the same for `if` at line 340. > When we fixing code we fix its style too if it was wrong. Thank you for pointing this out @vnkozlov, I've updated those if statements. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From sviswanathan at openjdk.java.net Tue May 17 22:01:56 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 17 May 2022 22:01:56 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 03:34:58 GMT, Vladimir Kozlov wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set > > Regarding @sviswa7 question. > > The comment in [sharedRuntime_x86_64.cpp#L458](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L458) says: > `16 bytes XMM registers are saved by default using fxsave/fxrstor instructions.` > That is why we did not care about saving 128 bit xmm registers before AVX512. Unfortunately `fxsave` saves only `xmm0-xmm15`. So we save `xmm16-xmm31` manually in the code Dean is fixing. But we save only 64-bits before. > > What I was surprise that there is no evex instruction to save only 128 bit of `xmm15-31` registers if `avx512vl` is not supported. I see specific asserts regarding that: [macroAssembler_x86.cpp#L2561](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L2561) @vnkozlov @dean-long The "else" path is for scalar usage of xmm registers. For vector usage the "if" path should have been taken. It looks to me that C->max_vector_size() is not being set properly from some specific Vector API path. It is set properly for auto vectorizer. Do we know which subtest fails in Float128VectorTest.java and what is the command line/platform where the subtest fails? ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From sviswanathan at openjdk.java.net Tue May 17 22:05:52 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 17 May 2022 22:05:52 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 03:34:58 GMT, Vladimir Kozlov wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set > > Regarding @sviswa7 question. > > The comment in [sharedRuntime_x86_64.cpp#L458](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L458) says: > `16 bytes XMM registers are saved by default using fxsave/fxrstor instructions.` > That is why we did not care about saving 128 bit xmm registers before AVX512. Unfortunately `fxsave` saves only `xmm0-xmm15`. So we save `xmm16-xmm31` manually in the code Dean is fixing. But we save only 64-bits before. > > What I was surprise that there is no evex instruction to save only 128 bit of `xmm15-31` registers if `avx512vl` is not supported. I see specific asserts regarding that: [macroAssembler_x86.cpp#L2561](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L2561) @vnkozlov vec_spill_helper in x86.ad shows how to save 128 bits or 256 bits on platforms where avx512vl is not supported. ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From duke at openjdk.java.net Tue May 17 22:12:38 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 17 May 2022 22:12:38 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: Message-ID: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> > We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). > > > JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup > > FloatClassCheck.testIsFinite 0.559 0.4 1.4x > FloatClassCheck.testIsInfinite 0.828 0.386 2.15x > FloatClassCheck.testIsNaN 2.589 0.387 6.7x > DoubleClassCheck.testIsFinite 0.568 0.414 1.37x > DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x > DoubleClassCheck.testIsNaN 2.592 0.393 6.6x > > JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup > FloatClassCheck.testIsFinite 0.561 0.468 1.2x > FloatClassCheck.testIsInfinite 0.793 0.491 1.61x > FloatClassCheck.testIsNaN 2.587 0.469 5.5x > DoubleClassCheck.testIsFinite 0.561 0.592 0.94x > DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x > DoubleClassCheck.testIsNaN 2.593 0.594 4.4x Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: Split the macros using predicate ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/f4769cd3..802f2c15 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=02-03 Stats: 39 lines in 1 file changed: 24 ins; 2 del; 13 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Tue May 17 23:07:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 17 May 2022 23:07:47 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 20:19:14 GMT, Brian J. Stafford wrote: >> The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: >> >> - Asserted the condition that makes sure the code runs at least once >> - Asserted the condition that makes sure the code runs only once >> - Removed the `while` loop >> - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. >> - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. >> >> Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. > > Brian J. Stafford has updated the pull request incrementally with two additional commits since the last revision: > > - Removed whitespace > - Added braces for if statements Good. Tobias is running testing. Please wait results. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Wed May 18 00:15:59 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 00:15:59 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Tue, 17 May 2022 17:39:51 GMT, Christian Hagedorn wrote: > Do you have a failing test for this case? We've been seeing a lot of SIGFPE failures lately with Java Fuzzer. I have to walk through them to see if I can find a case that is still failing with this fix. Will get back with the result of this analysis. Overflow only happens with dividend = MIN_VALUE and divisor = -1 so it would be hard to have a failing test for this case. I will try to come up with one. > I'm not sure if we should rely on signal catching to fix the cases where a division is wrongly floating above its zero check. I think we should not intentionally leave a graph in a broken state with the intention to fix it later at runtime. IIUC the graph is broken only because the division raises SIGFPE for invalid inputs. If we chose to ignore the signal instead then we can treat the Div nodes similar to how we treat other nodes such as Add or Sub and let them freely float around. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From duke at openjdk.java.net Wed May 18 01:13:00 2022 From: duke at openjdk.java.net (Haomin) Date: Wed, 18 May 2022 01:13:00 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: References: Message-ID: > static void test_fun(byte[] a0, int[] b0, byte[] c0) { > for (int i=0; i c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); > } > } > > > when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. > > It's executed on x86 would create an assert error. > > > # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 > # assert(false) failed: not supported: byte > > > RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. Haomin has updated the pull request incrementally with one additional commit since the last revision: modify testcase ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8740/files - new: https://git.openjdk.java.net/jdk/pull/8740/files/2a01af4a..05ebce18 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=00-01 Stats: 123 lines in 2 files changed: 36 ins; 38 del; 49 mod Patch: https://git.openjdk.java.net/jdk/pull/8740.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8740/head:pull/8740 PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 01:33:49 2022 From: duke at openjdk.java.net (Haomin) Date: Wed, 18 May 2022 01:33:49 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: References: Message-ID: <_dbS1KUqySC9ITrfoHqXI1Auo4d5YqrlNF23E3seK94=.326c115e-1b4f-457a-b5c9-461b2e51de78@github.com> On Wed, 18 May 2022 01:13:00 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > modify testcase @theRealELiu I have changed. Please review again, thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From njian at openjdk.java.net Wed May 18 01:34:50 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Wed, 18 May 2022 01:34:50 GMT Subject: RFR: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE In-Reply-To: References: Message-ID: On Tue, 17 May 2022 08:37:52 GMT, Nick Gasson wrote: > LGTM! Thank you, Nick! ------------- PR: https://git.openjdk.java.net/jdk/pull/8723 From njian at openjdk.java.net Wed May 18 01:37:48 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Wed, 18 May 2022 01:37:48 GMT Subject: Integrated: 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE In-Reply-To: References: Message-ID: <8Q5uhLsweukyIRxn1emxuU_TVweZAfHjaK2xed66h4M=.bb9ca325-ce0d-4d93-a5d7-6e9cebcb182a@github.com> On Mon, 16 May 2022 07:21:27 GMT, Ningsheng Jian wrote: > This is the REDO of JDK-8269559 and JDK-8275448. Those two backouts finally turned to be some system zlib issue in AArch64 macOS, and is not related to the patch itself. See [1][2] for details. > > This patch is generally the same as JDK-8275448, which uses SVE to optimize string_compare intrinsics for long string comparisons. I did a rebase with small tweaks to get better performance on recent Neoverse hardware. Test data on systems with different SVE vector sizes: > > > case delta size 128-bits 256-bits 512-bits > compareToLL 2 24 0.17% 0.58% 0.00% > compareToLL 2 36 0.00% 2.25% 0.04% > compareToLL 2 72 -4.40% 3.87% -12.82% > compareToLL 2 128 4.55% 58.31% 13.53% > compareToLL 2 256 19.39% 69.77% 82.03% > compareToLL 2 512 1.81% 68.38% 170.93% > compareToLU 2 24 25.57% 46.98% 54.61% > compareToLU 2 36 36.03% 70.26% 94.33% > compareToLU 2 72 35.86% 90.58% 146.04% > compareToLU 2 128 70.82% 119.19% 266.22% > compareToLU 2 256 80.77% 146.33% 420.01% > compareToLU 2 512 94.62% 171.72% 530.87% > compareToUL 2 24 20.82% 34.48% 62.14% > compareToUL 2 36 39.77% 60.79% 69.77% > compareToUL 2 72 35.46% 84.34% 121.90% > compareToUL 2 128 67.77% 110.97% 220.53% > compareToUL 2 256 77.05% 160.29% 331.30% > compareToUL 2 512 91.88% 184.57% 524.21% > compareToUU 2 24 -0.13% 0.40% 0.00% > compareToUU 2 36 -9.18% 12.84% -13.93% > compareToUU 2 72 1.67% 60.61% 6.69% > compareToUU 2 128 13.51% 60.33% 55.27% > compareToUU 2 256 2.55% 62.17% 153.26% > compareToUU 2 512 4.12% 68.62% 201.68% > > JTreg tests passed on SVE hardware. > > [1] https://bugs.openjdk.java.net/browse/JDK-8275448 > [2] https://bugs.openjdk.java.net/browse/JDK-8282954 This pull request has now been integrated. Changeset: b5526e5e Author: Ningsheng Jian URL: https://git.openjdk.java.net/jdk/commit/b5526e5e5935658ed1d39938441ae1a3417c0545 Stats: 443 lines in 7 files changed: 433 ins; 0 del; 10 mod 8281712: [REDO] AArch64: Implement string_compare intrinsic in SVE Co-authored-by: Tat Wai Chong Reviewed-by: thartmann, ngasson ------------- PR: https://git.openjdk.java.net/jdk/pull/8723 From duke at openjdk.java.net Wed May 18 02:26:55 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 02:26:55 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Tue, 17 May 2022 22:12:38 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Split the macros using predicate Hi, I'm working on #8525 which also improves the performance of these methods. With that patch, `isNaN` is reduced to the optimal sequence `ucomiss x, x; jp label`. This patch still benefits the performance of `isFinite` and `isInfinite` for float cases. For double cases without `vfpclass`, I'm not sure due to the materialisation of long constants, though. Also, can we output the result of the intrinsics directly in the flag registers? Thanks. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4753: > 4751: switch (opcode) { > 4752: case Op_IsFiniteF: > 4753: setb(Assembler::below, dst); This partial write may stall later reads on `dst`, you could emit a `xor dst, dst` before doing the comparison. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4785: > 4783: kmovbl(dst, tmp); > 4784: if (opcode == Op_IsFiniteF) { > 4785: xorl(dst, 0x00000001); `notl(dst)`? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4800: > 4798: mov64(temp1, KILL_SIGN_MASK); > 4799: andq(temp, temp1); > 4800: mov64(temp2, POS_INF); Can we use `temp1` for this, too? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From dlong at openjdk.java.net Wed May 18 02:34:44 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 18 May 2022 02:34:44 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 03:34:58 GMT, Vladimir Kozlov wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set > > Regarding @sviswa7 question. > > The comment in [sharedRuntime_x86_64.cpp#L458](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L458) says: > `16 bytes XMM registers are saved by default using fxsave/fxrstor instructions.` > That is why we did not care about saving 128 bit xmm registers before AVX512. Unfortunately `fxsave` saves only `xmm0-xmm15`. So we save `xmm16-xmm31` manually in the code Dean is fixing. But we save only 64-bits before. > > What I was surprise that there is no evex instruction to save only 128 bit of `xmm15-31` registers if `avx512vl` is not supported. I see specific asserts regarding that: [macroAssembler_x86.cpp#L2561](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L2561) > @vnkozlov @dean-long The "else" path is for scalar usage of xmm registers. For vector usage the "if" path should have been taken. It looks to me that C->max_vector_size() is not being set properly from some specific Vector API path. It is set properly for auto vectorizer. Do we know which subtest fails in Float128VectorTest.java and what is the command line/platform where the subtest fails? The failures are in MINReduceFloat128VectorTestsMasked() and MAXReduceFloat128VectorTestsMasked() when XMM16-XMM31 are used with -XX:UseAVX=3. Currently the "else" path is not just for scalars. It is also for vector sizes not considered "wide". ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From duke at openjdk.java.net Wed May 18 03:03:46 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 03:03:46 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: References: <0CPtvcY81Q2R9_6iONGV87qAJ4OH5izCGIAByWBBqSM=.5fd06621-8ecb-4788-ac3e-d5cd29784236@github.com> Message-ID: On Tue, 17 May 2022 08:46:21 GMT, Eric Liu wrote: >> Yes, `rotateLeftRes` is duplicated with `testRotateLeft`. But compile only `testRotateLeft`, and then compare the result between the two function. Could you give me some suggestions ? How should I modify this ? > > Two options off the top of my head: > > a) With `-Xcomp`, hardcode the expected values which are calculated by interpreter or just by hand and compared with the results of C2. > > b) With `-XX:-TieredCompilation`, generate the expected results if the method was not hot enough to compiled by C2. Then compared with C2 results, which can be got when the iteration count is more than 10K. Please refer to https://github.com/openjdk/jdk/pull/5403/files/71aa6ac439b67b27828e9aabe51845fa34602837#diff-d14ca09ba5fa806904a4db333037a14621cfeba81505ba369a375c57bd90c7a8 for more detail. > > Option a) is much more simpler. Option b) can verify more random values. You can tell the compiler to only compile the tested methods by using the `-XX:CompileCommand`, the other methods will only run in interpreter mode. As a result, although they look the same their real executions are completely different. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From dlong at openjdk.java.net Wed May 18 03:07:53 2022 From: dlong at openjdk.java.net (Dean Long) Date: Wed, 18 May 2022 03:07:53 GMT Subject: RFR: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest [v4] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 03:34:58 GMT, Vladimir Kozlov wrote: >> Dean Long has updated the pull request incrementally with one additional commit since the last revision: >> >> Just do full 512-bit memory accesses when -XX:+UseKNLSetting is set > > Regarding @sviswa7 question. > > The comment in [sharedRuntime_x86_64.cpp#L458](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/sharedRuntime_x86_64.cpp#L458) says: > `16 bytes XMM registers are saved by default using fxsave/fxrstor instructions.` > That is why we did not care about saving 128 bit xmm registers before AVX512. Unfortunately `fxsave` saves only `xmm0-xmm15`. So we save `xmm16-xmm31` manually in the code Dean is fixing. But we save only 64-bits before. > > What I was surprise that there is no evex instruction to save only 128 bit of `xmm15-31` registers if `avx512vl` is not supported. I see specific asserts regarding that: [macroAssembler_x86.cpp#L2561](https://github.com/openjdk/jdk/blob/master/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L2561) > @vnkozlov vec_spill_helper in x86.ad shows how to save 128 bits or 256 bits on platforms where avx512vl is not supported. Thanks, I could change the code to use vextractf32x4/vinsertf32x4, but if it's really important to optimize memory bandwidth for this case, then we should probably go with the C2-specific solution #2. @sviswa7, the problem happens when this C2 register class `reg_class_dynamic vectorx_reg_vlbwdq(vectorx_reg_evex, vectorx_reg_legacy, %{ VM_Version::supports_avx512vlbwdq() %} ); ` selects `vectorx_reg_evex` based on `supports_avx512vlbwdq()`. This will still use the "else" path because 16-byte vectors are not considered "wide" by is_wide_vector(). ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From eliu at openjdk.java.net Wed May 18 03:17:49 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Wed, 18 May 2022 03:17:49 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: References: Message-ID: <2V5Rgl0oZfnB3HoRD_dAQg9Krz0Dp70EXUVJsHCtRJc=.f1b775b7-a071-4ecf-8825-9037e4e7b150@github.com> On Wed, 18 May 2022 01:13:00 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > modify testcase I prefer to merge these two test cases into a single one. Vector related tests passed on AArch64 in my local dev container. @jatin-bhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 03:17:50 2022 From: duke at openjdk.java.net (Haomin) Date: Wed, 18 May 2022 03:17:50 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 01:13:00 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > modify testcase test/hotspot/jtreg/compiler/vectorization/TestRotateByteVector.java line 30: > 28: * @summary Test vectorization of rotate byte > 29: * @library /test/lib > 30: * @run main/othervm -XX:-TieredCompilation -XX:CompileCommand=compileonly,TestRotateByteVector::testRotate* -Xbatch TestRotateByteVector have used `-XX:CompileCommand=compileonly,TestRotateByteVector::testRotate*` ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 03:19:49 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 03:19:49 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Tue, 17 May 2022 22:12:38 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods isNaN(), isFinite() and IsInfinite() for Float and Double classes. JMH benchmarks show ~8x improvement for isNan(), ~3x improvement for isInfinite() and 15% gain for isFinite(). >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > Split the macros using predicate Also, for non `vfpclass` cases, it would be simpler and more efficient to implement in the Java side instead static Float::isFinite(float f) { return (floatToRawIntBits(f) & SIGN_ELIMINATION) < POS_INFINITY_BITS; } ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From xgong at openjdk.java.net Wed May 18 04:15:42 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Wed, 18 May 2022 04:15:42 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 01:13:00 GMT, Haomin wrote: >> static void test_fun(byte[] a0, int[] b0, byte[] c0) { >> for (int i=0; i> c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); >> } >> } >> >> >> when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. >> >> It's executed on x86 would create an assert error. >> >> >> # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 >> # assert(false) failed: not supported: byte >> >> >> RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. > > Haomin has updated the pull request incrementally with one additional commit since the last revision: > > modify testcase src/hotspot/share/opto/vectornode.cpp line 158: > 156: default: return 0; // RotateLeftV for byte, short values produces incorrect Java result. > 157: // Because java code should convert a byte, short value into int value, > 158: // and then do RotateI. I?m afraid this will influence the VectorAPI intrinsification for `ByteVector `and `ShortVector`. Did you test the benchmarks for these two APIs with the subword type? ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 04:30:53 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 04:30:53 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 02:23:10 GMT, Quan Anh Mai wrote: > Hi, I'm working on #8525 which also improves the performance of these methods. With that patch, `isNaN` is reduced to the optimal sequence `ucomiss x, x; jp label`. This patch still benefits the performance of `isFinite` and `isInfinite` for float cases. For double cases without `vfpclass`, I'm not sure due to the materialisation of long constants, though. > > Also, can we output the result of the intrinsics directly in the flag registers? Thanks. Thank you for the review! The main bottleneck for the performance of `isNan()` comes from popfq instruction used for fixing up the bits in flags register. Is your patch #8525 fixing that? > Also, for non `vfpclass` cases, it would be simpler and more efficient to implement in the Java side instead > > ``` > static Float::isFinite(float f) { > return (floatToRawIntBits(f) & SIGN_ELIMINATION) < POS_INFINITY_BITS; > } > ``` True, but that needs changes to java.lang.Float. Will that be approved? > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4753: > >> 4751: switch (opcode) { >> 4752: case Op_IsFiniteF: >> 4753: setb(Assembler::below, dst); > > This partial write may stall later reads on `dst`, you could emit a `xor dst, dst` before doing the comparison. Will try the `xor dst, dst` and see the performance changes. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4785: > >> 4783: kmovbl(dst, tmp); >> 4784: if (opcode == Op_IsFiniteF) { >> 4785: xorl(dst, 0x00000001); > > `notl(dst)`? I can't recall why `notl(dst)` didn't work. Will try it and let you know. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4800: > >> 4798: mov64(temp1, KILL_SIGN_MASK); >> 4799: andq(temp, temp1); >> 4800: mov64(temp2, POS_INF); > > Can we use `temp1` for this, too? Sure, will make the change ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From xxinliu at amazon.com Wed May 18 04:46:22 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Tue, 17 May 2022 21:46:22 -0700 Subject: [External] : Re: API to create a new Allocate node? In-Reply-To: <7457d99a-16c9-d3f3-a8b8-0592b8a99cb9@oracle.com> References: <03A842D1-E6A1-476F-8282-DD3E579A23DE@oracle.com> <1E5122C4-21E0-4B37-B4A8-47A83A133E63@oracle.com> <7457d99a-16c9-d3f3-a8b8-0592b8a99cb9@oracle.com> Message-ID: hi, Vladimir and Cesar, I fail to comprehend this part. Here I build a program like you did. black() is an external method. $cat MergeObjects.java public class MergeObjects { static class Point { public int x; public int y; Point(int x, int y) { this.x = x; this.y = y; } } public static void black() {} // java -Xcomp -XX:CompileOnly=compileonly,MergeObjects::merge2 -XX:CompileCommand=dontinline,MergeObjects::black -XX:+PrintCompilation -XX:CompileCommand=quiet -XX:+PrintEscapeAnalysis -XX:+PrintEscapeAnalysis MergeObject public static int merge2(boolean cond) { Point p; if (cond){ p = new Point(0, 0); } else { p = new Point(1, 2); } black(); // Allocation does not escape but it is refernced in debug info in JVMS of this call. return p.x; } public static void main(String[] args) { Point p = new Point(0, 0); // force to load class Point() merge2(true); } } If CU can't inline black(), black() becomes a game changer. we have to preserve live locals because bci=30 right after invokestatic #12 references local #1, which is p2 = phi(p0, p1). We don't know black(), it may trigger deoptimiztion for whatever reason and return to bci:30 in interpreter. public static int merge2(boolean); descriptor: (Z)I flags: (0x0009) ACC_PUBLIC, ACC_STATIC Code: stack=4, locals=2, args_size=1 0: iload_0 1: ifeq 17 4: new #7 // class MergeObjects$Point 7: dup 8: iconst_0 9: iconst_0 10: invokespecial #9 // Method MergeObjects$Point."":(II)V 13: astore_1 14: goto 27 17: new #7 // class MergeObjects$Point 20: dup 21: iconst_1 22: iconst_2 23: invokespecial #9 // Method MergeObjects$Point."":(II)V 26: astore_1 27: invokestatic #12 // Method black:()V 30: aload_1 31: getfield #17 // Field MergeObjects$Point.x:I 34: ireturn No matter we invent a device or split if node, I think it's inevitable to rematerialize local#1 in deoptimization. Does HotSpot support that? From my reading, we serialize an object pool in the backend. Each object has an identifier which was its allocate node idx(even though the allocate node has long gone after codegen). I feel we can't perform this rematerialization in flow sensitive way because we miss a selector in debuginfo. I see many uncommon_trap nodes retain locals references. That's why I am working on JDK-8286104. I think this can be a new feature of deoptimization module. Allow me rephrase it. Is it worth pursuing? In general, there is a phiNode L = phi(p0, p1, ... , pk). p0 is a unique non-escaping object and others are global objects. L is live at a safepoint Node. deoptimization supports to rematerialize it. thanks, --lx On 4/29/22 4:20 PM, Vladimir Kozlov wrote: > has own ID which we can use to > construct debug info for deoptimization. And it will be gone after macro expansion - replaced with > SafePointScalarObjectNode or simply removed if it is not referenced. From duke at openjdk.java.net Wed May 18 04:55:37 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 04:55:37 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v5] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. > > > JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup > > FloatClassCheck.testIsFinite 0.559 0.4 1.4x > FloatClassCheck.testIsInfinite 0.828 0.386 2.15x > FloatClassCheck.testIsNaN 2.589 0.387 6.7x > DoubleClassCheck.testIsFinite 0.568 0.414 1.37x > DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x > DoubleClassCheck.testIsNaN 2.592 0.393 6.6x > > JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup > FloatClassCheck.testIsFinite 0.561 0.468 1.2x > FloatClassCheck.testIsInfinite 0.793 0.491 1.61x > FloatClassCheck.testIsNaN 2.587 0.469 5.5x > DoubleClassCheck.testIsFinite 0.561 0.592 0.94x > DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x > DoubleClassCheck.testIsNaN 2.593 0.594 4.4x Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: remove the redundant temp register ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/802f2c15..cf13ec76 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=03-04 Stats: 8 lines in 3 files changed: 0 ins; 1 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 05:17:44 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 05:17:44 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. > > > JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup > > FloatClassCheck.testIsFinite 0.559 0.4 1.4x > FloatClassCheck.testIsInfinite 0.828 0.386 2.15x > FloatClassCheck.testIsNaN 2.589 0.387 6.7x > DoubleClassCheck.testIsFinite 0.568 0.414 1.37x > DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x > DoubleClassCheck.testIsNaN 2.592 0.393 6.6x > > JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup > FloatClassCheck.testIsFinite 0.561 0.468 1.2x > FloatClassCheck.testIsInfinite 0.793 0.491 1.61x > FloatClassCheck.testIsNaN 2.587 0.469 5.5x > DoubleClassCheck.testIsFinite 0.561 0.592 0.94x > DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x > DoubleClassCheck.testIsNaN 2.593 0.594 4.4x Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: use 0x1 to be simpler ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/cf13ec76..0fc8679c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=04-05 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 05:17:44 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 05:17:44 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 04:26:23 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4785: >> >>> 4783: kmovbl(dst, tmp); >>> 4784: if (opcode == Op_IsFiniteF) { >>> 4785: xorl(dst, 0x00000001); >> >> `notl(dst)`? > > I can't recall why `notl(dst)` didn't work. Will try it and let you know. notl(dst) doesn't work. we need to flip only the last bit as the kreg stores the result in the last bit. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 05:57:52 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 05:57:52 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 00:36:18 GMT, Vladimir Kozlov wrote: > Impressive. Few comments. > > You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. > > And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. > > Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. > > Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? > > You can reuse `tmp1` in `double_class_check()`. Hi Vladimir (@vnkozlov), Sorry for the delay! As per your suggestions, the JMH benchmarks were updated to use these Java methods in conditions and updated the RFE with performance data with and without the vfpclassss/d instructions. Also removed the redundant temp2 as per your suggestion. Will work on condensing the 3 nodes to one and add support for x86_32.ad. Thanks, Vamsi ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 05:57:55 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 05:57:55 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 04:23:33 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4753: >> >>> 4751: switch (opcode) { >>> 4752: case Op_IsFiniteF: >>> 4753: setb(Assembler::below, dst); >> >> This partial write may stall later reads on `dst`, you could emit a `xor dst, dst` before doing the comparison. > > Will try the `xor dst, dst` and see the performance changes. setb either writes a 0 or 1 based on the condition. So, will it cause partial writes? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 06:05:47 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 06:05:47 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 05:54:40 GMT, Srinivas Vamsi Parasa wrote: >> Will try the `xor dst, dst` and see the performance changes. > > setb either writes a 0 or 1 based on the condition. So, will it cause partial writes? Actually `setb` only writes the byte portion and leaves the remaining of the register intact, so it would be wrong without clearing the register beforehand. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 06:10:49 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 06:10:49 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 05:52:02 GMT, Srinivas Vamsi Parasa wrote: >> Impressive. Few comments. >> >> You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. >> >> And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. >> >> Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. >> >> Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? >> >> You can reuse `tmp1` in `double_class_check()`. > >> Impressive. Few comments. >> >> You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. >> >> And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. >> >> Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. >> >> Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? >> >> You can reuse `tmp1` in `double_class_check()`. > > Hi Vladimir (@vnkozlov), > Sorry for the delay! > As per your suggestions, the JMH benchmarks were updated to use these Java methods in conditions and updated the RFE with performance data with and without the vfpclassss/d instructions. Also removed the redundant temp2 as per your suggestion. > Will work on condensing the 3 nodes to one and add support for x86_32.ad. > > Thanks, > Vamsi @vamsi-parasa Yes #8525 change the matching of `Bool` to `cmpOpUCF` for eq/ne where both inputs are the same and add `CMove` rules for `cmpOpUCF2`, which should prevent floating point comparison from matching `cmpOpU`, which has bad overhead of fixing the flags. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 18 06:10:49 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 06:10:49 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: <-nfCUHmYevQi8eaDKHC3kre2d0yAkbAE3y9neU87o2I=.e054b453-34e3-45fa-a687-9137b5826175@github.com> On Wed, 18 May 2022 05:17:44 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > use 0x1 to be simpler src/hotspot/share/opto/fpclassnode.hpp line 86: > 84: }; > 85: > 86: #endif // SHARE_OPTO_FPCLASSNODE_HPP You can move these to src/hotspot/share/opto/intrinsicnode.hpp ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 06:37:49 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 06:37:49 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 06:02:27 GMT, Quan Anh Mai wrote: >> setb either writes a 0 or 1 based on the condition. So, will it cause partial writes? > > Actually `setb` only writes the byte portion and leaves the remaining of the register intact, so it would be wrong without clearing the register beforehand. `setb` is producing the correct results and also adding the `xor dst, dst` didn't give any performance improvement. Is it still necessary? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 06:37:50 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 06:37:50 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: <-nfCUHmYevQi8eaDKHC3kre2d0yAkbAE3y9neU87o2I=.e054b453-34e3-45fa-a687-9137b5826175@github.com> References: <-nfCUHmYevQi8eaDKHC3kre2d0yAkbAE3y9neU87o2I=.e054b453-34e3-45fa-a687-9137b5826175@github.com> Message-ID: On Wed, 18 May 2022 06:07:00 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> use 0x1 to be simpler > > src/hotspot/share/opto/fpclassnode.hpp line 86: > >> 84: }; >> 85: >> 86: #endif // SHARE_OPTO_FPCLASSNODE_HPP > > You can move these to src/hotspot/share/opto/intrinsicnode.hpp Thanks Jatin! Will make the change and push the update. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 18 06:50:00 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 06:50:00 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 06:34:34 GMT, Srinivas Vamsi Parasa wrote: >> Actually `setb` only writes the byte portion and leaves the remaining of the register intact, so it would be wrong without clearing the register beforehand. > > `setb` is producing the correct results and also adding the `xor dst, dst` didn't give any performance improvement. Is it still necessary? > This partial write may stall later reads on `dst`, you could emit a `xor dst, dst` before doing the comparison. APIs generate boolean result, and any reader should only be consuming 8 bit result. Thus its not a scenario for partial register stall from HW perspective, a narrow write followed by wider read cause a partial register stall due to additional cycles penalty incurred while re-assembling the unmodified bits with modified portion. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 18 06:50:12 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 06:50:12 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 05:17:44 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: > > use 0x1 to be simpler src/hotspot/cpu/x86/x86_64.ad line 11183: > 11181: ins_pipe(pipe_slow); > 11182: %} > 11183: Should this be guarded by LP64 ? since associated macro assembly routines have that check ? Also match_rule_supported vector should prevent intrinsifying these for 32 bit platform if you do not plan to handle them. src/hotspot/share/opto/fpclassnode.hpp line 65: > 63: virtual int Opcode() const; > 64: const Type* bottom_type() const { return TypeInt::BOOL; } > 65: virtual uint ideal_reg() const { return Op_RegI; } None of the IR nodes handle constant folding scenarios. test/micro/org/openjdk/bench/java/lang/DoubleClassCheck.java line 70: > 68: public void testIsFinite() { > 69: for (int i = 0; i < BUFFER_SIZE; i++) { > 70: outputs[i] = Double.isFinite(inputs[i]) ? false : true; Any specific reason to explicitly add conditional check to return true/false when isFinite returns bool value ? test/micro/org/openjdk/bench/java/lang/DoubleClassCheck.java line 78: > 76: public void testIsInfinite() { > 77: for (int i = 0; i < BUFFER_SIZE; i++) { > 78: outputs[i] = Double.isInfinite(inputs[i]) ? false : true; Any specific reason to explicitly add conditional check to return true/false when isInfinit returns bool value ? test/micro/org/openjdk/bench/java/lang/DoubleClassCheck.java line 86: > 84: public void testIsNaN() { > 85: for (int i = 0; i < BUFFER_SIZE; i++) { > 86: outputs[i] = Double.isNaN(inputs[i]) ? false : true; Any specific reason to explicitly add conditional check to return true/false when isNaN returns bool value ? test/micro/org/openjdk/bench/java/lang/FloatClassCheck.java line 45: > 43: public class FloatClassCheck { > 44: > 45: RandomGenerator rng; Just a suggestion we can also create one benchmark to handle both the floating point types. test/micro/org/openjdk/bench/java/lang/FloatClassCheck.java line 70: > 68: public void testIsFinite() { > 69: for (int i = 0; i < BUFFER_SIZE; i++) { > 70: outputs[i] = Float.isFinite(inputs[i]) ? false : true; Same a above test/micro/org/openjdk/bench/java/lang/FloatClassCheck.java line 78: > 76: public void testIsInfinite() { > 77: for (int i = 0; i < BUFFER_SIZE; i++) { > 78: outputs[i] = Float.isInfinite(inputs[i]) ? false : true; Same a above test/micro/org/openjdk/bench/java/lang/FloatClassCheck.java line 86: > 84: public void testIsNaN() { > 85: for (int i = 0; i < BUFFER_SIZE; i++) { > 86: outputs[i] = Float.isNaN(inputs[i]) ? false : true; Same as above ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 07:01:00 2022 From: duke at openjdk.java.net (Haomin) Date: Wed, 18 May 2022 07:01:00 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: Message-ID: > static void test_fun(byte[] a0, int[] b0, byte[] c0) { > for (int i=0; i c0[i] = (byte)(a0[i] << (7) | a0[i] >>> (-7)); > } > } > > > when I implement RotateLeftV in loongarch.ad, I found this executed by c2 vector and executed by interpreter are not equal. > > It's executed on x86 would create an assert error. > > > # Internal Error (/home/wanghaomin/jdk/src/hotspot/share/opto/vectornode.cpp:347), pid=26469, tid=26485 > # assert(false) failed: not supported: byte > > > RotateLeftV for byte, short values produces incorrect Java result. Because java code should convert a byte, short value into int value, and then do RotateI. Haomin has updated the pull request incrementally with one additional commit since the last revision: merge the two cases into one ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8740/files - new: https://git.openjdk.java.net/jdk/pull/8740/files/05ebce18..ba5428de Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8740&range=01-02 Stats: 364 lines in 3 files changed: 156 ins; 208 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8740.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8740/head:pull/8740 PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 07:03:57 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 07:03:57 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 05:52:02 GMT, Srinivas Vamsi Parasa wrote: >> Impressive. Few comments. >> >> You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. >> >> And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. >> >> Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. >> >> Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? >> >> You can reuse `tmp1` in `double_class_check()`. > >> Impressive. Few comments. >> >> You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. >> >> And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. >> >> Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. >> >> Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? >> >> You can reuse `tmp1` in `double_class_check()`. > > Hi Vladimir (@vnkozlov), > Sorry for the delay! > As per your suggestions, the JMH benchmarks were updated to use these Java methods in conditions and updated the RFE with performance data with and without the vfpclassss/d instructions. Also removed the redundant temp2 as per your suggestion. > Will work on condensing the 3 nodes to one and add support for x86_32.ad. > > Thanks, > Vamsi > @vamsi-parasa Yes #8525 change the matching of `Bool` to `cmpOpUCF` for eq/ne where both inputs are the same and add `CMove` rules for `cmpOpUCF2`, which should prevent floating point comparison from matching `cmpOpU`, which has bad overhead of fixing the flags. Glad to know that you removed the overhead of fixing up the flags register. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 07:03:58 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 07:03:58 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 06:34:34 GMT, Srinivas Vamsi Parasa wrote: >> Actually `setb` only writes the byte portion and leaves the remaining of the register intact, so it would be wrong without clearing the register beforehand. > > `setb` is producing the correct results and also adding the `xor dst, dst` didn't give any performance improvement. Is it still necessary? @vamsi-parasa Other boolean producers such as `Conv2B` or `VectorTest` widen the byte value to 32 bit after `setb` so I believe a zeroing is necessary here. You accidentally achieve correct results in tests because the boolean stores only read the least significant 8 bits of the `dst` registers. Other operations such as `test` will read full `int` and may lead to incorrect results. @jatin-bhateja Boolean is just dressed-up int except in stores so most readers will consume 32 bits of the register. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 07:04:03 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 07:04:03 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 06:17:37 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> use 0x1 to be simpler > > test/micro/org/openjdk/bench/java/lang/DoubleClassCheck.java line 70: > >> 68: public void testIsFinite() { >> 69: for (int i = 0; i < BUFFER_SIZE; i++) { >> 70: outputs[i] = Double.isFinite(inputs[i]) ? false : true; > > Any specific reason to explicitly add conditional check to return true/false when isFinite returns bool value ? Initially, it was returning a boolean value and storing it in the output buffer. Vladimir suggested that the realword usecase of these methods is in conditions. Hence, the benchmarks were modified. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 18 07:05:00 2022 From: duke at openjdk.java.net (Haomin) Date: Wed, 18 May 2022 07:05:00 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v2] In-Reply-To: <2V5Rgl0oZfnB3HoRD_dAQg9Krz0Dp70EXUVJsHCtRJc=.f1b775b7-a071-4ecf-8825-9037e4e7b150@github.com> References: <2V5Rgl0oZfnB3HoRD_dAQg9Krz0Dp70EXUVJsHCtRJc=.f1b775b7-a071-4ecf-8825-9037e4e7b150@github.com> Message-ID: On Wed, 18 May 2022 03:11:31 GMT, Eric Liu wrote: > I prefer to merge these two test cases into a single one. DONE ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 07:05:04 2022 From: duke at openjdk.java.net (Haomin) Date: Wed, 18 May 2022 07:05:04 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 04:12:39 GMT, Xiaohong Gong wrote: >> Haomin has updated the pull request incrementally with one additional commit since the last revision: >> >> merge the two cases into one > > src/hotspot/share/opto/vectornode.cpp line 158: > >> 156: default: return 0; // RotateLeftV for byte, short values produces incorrect Java result. >> 157: // Because java code should convert a byte, short value into int value, >> 158: // and then do RotateI. > > I?m afraid this will influence the VectorAPI intrinsification for `ByteVector `and `ShortVector`. Did you test the benchmarks for these two APIs with the subword type? I have tested `hotspot_vector_1` and `jdk_vector`, all pass. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Wed May 18 07:18:59 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Wed, 18 May 2022 07:18:59 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v10] In-Reply-To: References: Message-ID: On Fri, 13 May 2022 11:54:37 GMT, Emanuel Peter wrote: >> I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. >> >> `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` >> >> While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. >> >> Please let me know if you would find this helpful, or if you have any feedback to improve it. >> Thanks, Emanuel >> >> **1. Better dump()** >> The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: >> >> 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). >> 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. >> 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! >> 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. >> 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. >> >> Example: >> >> (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") >> No target: perform BFS. >> dis par c dump >> --------------------------------------------- >> 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] >> 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] >> 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL >> 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] >> 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] >> >> >> Example with Mach nodes: >> >> (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") >> No target: perform BFS. >> dis [head idom d] old par c dump >> --------------------------------------------- >> 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] >> 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] >> 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] >> 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] >> 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] >> 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] >> 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] >> 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] >> >> >> **2. Find loop body** >> When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. >> `loop_end->print_bfs(20, loop_head, "cox+")` >> This provides us with a shortest path, given this path has a distance of at most 20. >> >> Example: >> >> (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") >> Find shortest path: 158 -> 160. >> >> Backtrace target. >> dis c dump >> --------------------------------------------- >> 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] >> 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] >> 7 c 363 If === 358 351 [[ 364 367 ]] >> 6 c 364 IfTrue === 363 [[ 128 ]] >> 5 c 128 If === 364 127 [[ 129 130 ]] >> 4 c 129 IfTrue === 128 [[ 155 ]] >> 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] >> 2 c 157 IfFalse === 155 [[ 162 163 ]] >> 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] >> 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] >> >> Example with Mach nodes: >> >> (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") >> Find shortest path: 159 -> 27. >> >> Backtrace target. >> dis [head idom d] old e c dump >> --------------------------------------------- >> 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] >> 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] >> 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] > > Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: > > fixing file copyright year Feedback from wider group: I will remove `Node::dump_related` and related functions. It seems it was added 7 years ago and people did not use it, and it is not even implemented fully for all node types. I will improve my query mechanism to not just filter during the traversal, but also optionally include boundary nodes of the search. I will make sure that all of the `dump` functions use this new query mechanism underneath. ------------- PR: https://git.openjdk.java.net/jdk/pull/8468 From jbhateja at openjdk.java.net Wed May 18 07:38:03 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 07:38:03 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v4] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8283894: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/93aa5e2d..1cbb353a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=02-03 Stats: 154 lines in 4 files changed: 63 ins; 46 del; 45 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From duke at openjdk.java.net Wed May 18 07:47:59 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 07:47:59 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 07:00:10 GMT, Quan Anh Mai wrote: >> `setb` is producing the correct results and also adding the `xor dst, dst` didn't give any performance improvement. Is it still necessary? > > @vamsi-parasa Other boolean producers such as `Conv2B` or `VectorTest` widen the byte value to 32 bit after `setb` so I believe a zeroing is necessary here. You accidentally achieve correct results in tests because the boolean stores only read the least significant 8 bits of the `dst` registers. Other operations such as `test` will read full `int` and may lead to incorrect results. > > @jatin-bhateja Boolean is just dressed-up int except in stores so most readers will consume 32 bits of the register. > > Thanks. @merykitty You might be right. `setb` was causing a wierd jdk build error in `java.lang.HashMap` due to to error in `isNaN()`. It got fixed by clearing the higher bytes of `dst` register which is being set by `setb`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 18 07:50:39 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 07:50:39 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v5] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8283894: Updating test tag spec. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/1cbb353a..666e8589 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=03-04 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From thartmann at openjdk.java.net Wed May 18 08:42:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 08:42:41 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. I completely agree with Christian, we should **not** let the division float above its zero check for various reasons: - The graph is in a broken/inconsistent state that might lead to all kinds of subsequent issues. For example, if the division floats upwards, the divisor can become statically known to be zero. The division could then be replaced by TOP which would propagate downwards while the zero check is not necessarily removed as well, leading to an unschedulable graph. The same can happen with overflows. - Catching the SIGFPE has an impact on performance and can potentially happen in a hot path. - There is a risk of ignoring a "real" SIGFPE caused by a C2 bug. For example, if C2 erroneously removes the zero check, we would just continue execution on a SIGFPE, making things worse by essentially converting a crash to a wrong execution issue that might lead to all kinds of weird failures that are hard to debug. Given that this issue leads to massive failures in our fuzzer testing, I would suggest to back out [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742) for now and properly re-implement it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From mgronlun at openjdk.java.net Wed May 18 09:09:52 2022 From: mgronlun at openjdk.java.net (Markus =?UTF-8?B?R3LDtm5sdW5k?=) Date: Wed, 18 May 2022 09:09:52 GMT Subject: Integrated: 8280844: Epoch shift synchronization point for Compiler threads is inadequate In-Reply-To: References: Message-ID: On Mon, 16 May 2022 10:17:42 GMT, Markus Gr?nlund wrote: > Greetings, > > [JDK-8233111](https://bugs.openjdk.java.net/browse/JDK-8233111) attempted to address artefact tagging for Compiler threads, letting threads run _thread_in_native to avoid the transition. Unfortunately, that attempt proved inadequate. > > The epoch race is avoided only by performing the transition to _thread_in_vm. > > Testing: jdk_jfr > > Thanks > Markus This pull request has now been integrated. Changeset: d936c302 Author: Markus Gr?nlund URL: https://git.openjdk.java.net/jdk/commit/d936c3024acf428df6d1fb3064a1d8aa5038d277 Stats: 100 lines in 6 files changed: 12 ins; 84 del; 4 mod 8280844: Epoch shift synchronization point for Compiler threads is inadequate Reviewed-by: egahlin ------------- PR: https://git.openjdk.java.net/jdk/pull/8724 From thartmann at openjdk.java.net Wed May 18 09:47:54 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 09:47:54 GMT Subject: RFR: 8286870: Memory leak with RepeatCompilation In-Reply-To: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> References: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> Message-ID: On Tue, 17 May 2022 11:50:38 GMT, Tobias Hartmann wrote: > While using `RepeatCompilation` in combination with replay compilation and stress options to reproduce an intermittent issue, I noticed that it does not free the compiler thread arena after each compilation, leading to a (temporary) memory leak and out of memory errors. For example, each compilation of [JDK-8280696](https://bugs.openjdk.java.net/browse/JDK-8280696) allocates an additional 1218 kB. > > The fix is to simply add a `ResourceMark` in the loop. > > Thanks, > Tobias Thanks for the review, Vladimir. The github failures seem unrelated and due to missing Loom support on 32-bit: testEnablePreview (--enable-preview) # Internal Error (stubGenerator_x86_32.cpp:3877), pid=33952, tid=33955 # Error: Unimplemented() Which is this stub: RuntimeStub* generate_cont_doYield() { if (!Continuations::enabled()) return nullptr; Unimplemented(); return nullptr; } ------------- PR: https://git.openjdk.java.net/jdk/pull/8744 From mdoerr at openjdk.java.net Wed May 18 09:52:00 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 18 May 2022 09:52:00 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. I agree, too. Catching SIGFPE was an interesting experiment, but I didn't want to propose it as fix. If we decide for a complete backout of [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742), we need to backout [JDK-8285390](https://bugs.openjdk.java.net/browse/JDK-8285390) as well. (Alternatively, we could only revert the x86 parts. Current implementation works on PPC64.) ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From jvernee at openjdk.java.net Wed May 18 09:53:31 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Wed, 18 May 2022 09:53:31 GMT Subject: Integrated: 8283689: Update the foreign linker VM implementation In-Reply-To: References: Message-ID: <96UqTMEdAdl4yleY0IBU6LJjL8ww23IUcktML7Aa8UM=.9bfa69dc-720b-43aa-9b77-147b29d925cd@github.com> On Fri, 25 Mar 2022 13:48:20 GMT, Jorn Vernee wrote: > Hi, > > This PR updates the VM implementation of the foreign linker, by bringing over commits from the panama-foreign repo. > > This is split off from the main JEP integration for 19, since we have limited resources to handle this. As such, this PR might fall over to 20, but it would be nice if we could get it into 19. > > I've written up an overview of the Linker architecture here: http://cr.openjdk.java.net/~jvernee/docs/FL_Overview.html it might be useful to read that first. > > This patch moves from the "legacy" implementation, to what is currently implemented in the panama-foreign repo, except for replacing the use of method handle combinators with ASM. That will come in a later path. To recap. This PR contains the following changes: > > 1. VM stubs for downcalls are now generated up front, instead of lazily by C2 [1]. > 2. the VM support for upcalls/downcalls now support all possible call shapes. And VM stubs and Java code implementing the buffered invocation strategy has been removed [2], [3], [4], [5]. > 3. The existing C2 intrinsification support for the `linkToNative` method handle linker was no longer needed and has been removed [6] (support might be re-added in another form later). > 4. Some other cleanups, such as: OptimizedEntryBlob (for upcalls) now implements RuntimeBlob directly. Binding to java classes has been rewritten to use javaClasses.h/cpp (this wasn't previously possible due to these java classes being in an incubator module) [7], [8], [9]. > > While the patch mostly consists of VM changes, there are also some Java changes to support (2). > > The original commit structure has been mostly retained, so it might be useful to look at a specific commit, or the corresponding patch in the [panama-foreign](https://github.com/openjdk/panama-foreign/pulls?q=is%3Apr) repo as well. I've also left some inline comments to explain some of the changes, which will hopefully make reviewing easier. > > Testing: Tier1-4 > > Thanks, > Jorn > > [1]: https://github.com/openjdk/jdk/pull/7959/commits/048b88156814579dca1f70742061ad24942fd358 > [2]: https://github.com/openjdk/jdk/pull/7959/commits/2fbbef472b4c2b4fee5ede2f18cd81ab61e88f49 > [3]: https://github.com/openjdk/jdk/pull/7959/commits/8a957a4ed9cc8d1f708ea8777212eb51ab403dc3 > [4]: https://github.com/openjdk/jdk/pull/7959/commits/35ba1d964f1de4a77345dc58debe0565db4b0ff3 > [5]: https://github.com/openjdk/jdk/pull/7959/commits/4e72aae22920300c5ffa16fed805b62ed9092120 > [6]: https://github.com/openjdk/jdk/pull/7959/commits/08e22e1b468c5c8f0cfd7135c72849944068aa7a > [7]: https://github.com/openjdk/jdk/pull/7959/commits/451cd9edf54016c182dab21a8b26bd8b609fc062 > [8]: https://github.com/openjdk/jdk/pull/7959/commits/4c851d2795afafec3a3ab17f4142ee098692068f > [9]: https://github.com/openjdk/jdk/pull/7959/commits/d025377799424f31512dca2ffe95491cd5ae22f9 This pull request has now been integrated. Changeset: 81e4bdbe Author: Jorn Vernee URL: https://git.openjdk.java.net/jdk/commit/81e4bdbe1358b7feced08ba758ddb66415968036 Stats: 6914 lines in 155 files changed: 2577 ins; 3219 del; 1118 mod 8283689: Update the foreign linker VM implementation Co-authored-by: Jorn Vernee Co-authored-by: Nick Gasson Reviewed-by: mcimadamore, vlivanov, rehn ------------- PR: https://git.openjdk.java.net/jdk/pull/7959 From chagedorn at openjdk.java.net Wed May 18 10:46:01 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 18 May 2022 10:46:01 GMT Subject: RFR: 8286870: Memory leak with RepeatCompilation In-Reply-To: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> References: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> Message-ID: <1_Xk4vXG4eIDFt37i2ITppRGGQPLxUn5FoPzcHCNb_Y=.8e4ea4cd-a044-4d6a-8024-32179fe8a945@github.com> On Tue, 17 May 2022 11:50:38 GMT, Tobias Hartmann wrote: > While using `RepeatCompilation` in combination with replay compilation and stress options to reproduce an intermittent issue, I noticed that it does not free the compiler thread arena after each compilation, leading to a (temporary) memory leak and out of memory errors. For example, each compilation of [JDK-8280696](https://bugs.openjdk.java.net/browse/JDK-8280696) allocates an additional 1218 kB. > > The fix is to simply add a `ResourceMark` in the loop. > > Thanks, > Tobias Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8744 From thartmann at openjdk.java.net Wed May 18 10:52:47 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 10:52:47 GMT Subject: RFR: 8286870: Memory leak with RepeatCompilation In-Reply-To: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> References: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> Message-ID: On Tue, 17 May 2022 11:50:38 GMT, Tobias Hartmann wrote: > While using `RepeatCompilation` in combination with replay compilation and stress options to reproduce an intermittent issue, I noticed that it does not free the compiler thread arena after each compilation, leading to a (temporary) memory leak and out of memory errors. For example, each compilation of [JDK-8280696](https://bugs.openjdk.java.net/browse/JDK-8280696) allocates an additional 1218 kB. > > The fix is to simply add a `ResourceMark` in the loop. > > Thanks, > Tobias Thanks, Christian! ------------- PR: https://git.openjdk.java.net/jdk/pull/8744 From chagedorn at openjdk.java.net Wed May 18 10:59:53 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 18 May 2022 10:59:53 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Wed, 18 May 2022 00:11:56 GMT, Quan Anh Mai wrote: >> The fix looks reasonable to address the same problems in [JDK-8257822](https://bugs.openjdk.java.net/browse/JDK-8257822) and [JDK-8248552](https://bugs.openjdk.java.net/browse/JDK-8248552) for the new nodes `Div` and `Mod` nodes. >> >>> The problem is that NoOvfDivI does not only depend on the zero-divisor check but a possible overflow check as well. So with this fix it is still possible for a SIGFPE to occur. >> >> Do you have a failing test for this case? We've been seeing a lot of SIGFPE failures lately with Java Fuzzer. I have to walk through them to see if I can find a case that is still failing with this fix. Will get back with the result of this analysis. >> >>> IIUC this trouble comes from the fact that on x86 a Div node must be pinned to its zero-divisor check but may float with regards to other control nodes. Maybe we can remove all this special handling and simply catch SIGFPE instead? The result is guaranteed to not be used in those cases so we may not worry about the correctness of the compiled code. >> >> I'm not sure if we should rely on signal catching to fix the cases where a division is wrongly floating above its zero check. I think we should not intentionally leave a graph in a broken state with the intention to fix it later at runtime. > >> Do you have a failing test for this case? We've been seeing a lot of SIGFPE failures lately with Java Fuzzer. I have to walk through them to see if I can find a case that is still failing with this fix. Will get back with the result of this analysis. > > Overflow only happens with dividend = MIN_VALUE and divisor = -1 so it would be hard to have a failing test for this case. I will try to come up with one. > >> I'm not sure if we should rely on signal catching to fix the cases where a division is wrongly floating above its zero check. I think we should not intentionally leave a graph in a broken state with the intention to fix it later at runtime. > > IIUC the graph is broken only because the division raises SIGFPE for invalid inputs. If we choose to ignore the signal instead then we can treat the Div nodes similar to how we treat other nodes such as Add or Sub and let them freely float around. > > Thanks. I've checked all our fuzzer crashes and the current fix works for all of them. But given that there is potentially another issue with the overflow check mentioned by @merykitty, we could indeed think about backing the changes out for JDK 19 as we are getting closer to the fork. I'd suggest to wait and see for now if @merykitty was able to find a case where we still crash. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From thartmann at openjdk.java.net Wed May 18 11:16:48 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 11:16:48 GMT Subject: Integrated: 8286870: Memory leak with RepeatCompilation In-Reply-To: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> References: <0SwF1Qb_W-aDBnwRcLLtBCb2JzfTpNjYhaCYApP_Z6M=.de140365-682c-422b-8fc5-5c9394cf631a@github.com> Message-ID: On Tue, 17 May 2022 11:50:38 GMT, Tobias Hartmann wrote: > While using `RepeatCompilation` in combination with replay compilation and stress options to reproduce an intermittent issue, I noticed that it does not free the compiler thread arena after each compilation, leading to a (temporary) memory leak and out of memory errors. For example, each compilation of [JDK-8280696](https://bugs.openjdk.java.net/browse/JDK-8280696) allocates an additional 1218 kB. > > The fix is to simply add a `ResourceMark` in the loop. > > Thanks, > Tobias This pull request has now been integrated. Changeset: 69ff86a3 Author: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/69ff86a32088d9664e5e0dae12edddc0643e3fd3 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod 8286870: Memory leak with RepeatCompilation Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8744 From mdoerr at openjdk.java.net Wed May 18 12:40:58 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 18 May 2022 12:40:58 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. While evaluation is ongoing I have thought a bit about making the div node dependent on a single check which avoids all SIGFPE cases: diff --git a/src/hotspot/share/opto/parse3.cpp b/src/hotspot/share/opto/parse3.cpp index 7e0f854436b..9d4226571ae 100644 --- a/src/hotspot/share/opto/parse3.cpp +++ b/src/hotspot/share/opto/parse3.cpp @@ -484,26 +484,30 @@ void Parse::do_divmod_fixup() { return; } - // The generated graph is equivalent to (in2 == -1) ? -in1 : (in1 / in2) - // we need to have a separate branch for in2 == -1 due to the special - // case of min_jint / -1 - Node* cmp = _gvn.transform(CmpNode::make(in2, _gvn.integercon(-1, bt), bt)); - Node* bol = Bool(cmp, BoolTest::eq); - IfNode* iff = create_and_map_if(control(), bol, PROB_UNLIKELY_MAG(3), COUNT_UNKNOWN); + // Check if in2 < -1 or > 1 by one CmpUNode (in2 + 1 >u 1). Reason: + // Div node will have to get pinned, but we can only pin to one region. + Node* inc = _gvn.transform(AddNode::make(in2, _gvn.integercon(1, bt), bt)); + Node* cmp = _gvn.transform(CmpUNode::make(inc, _gvn.integercon(1, bt), bt)); + + Node* bol = Bool(cmp, BoolTest::gt); + IfNode* iff = create_and_map_if(control(), bol, PROB_MAX, COUNT_UNKNOWN); Node* iff_true = IfTrue(iff); Node* iff_false = IfFalse(iff); + + Node* res_slow = generate_division(_gvn, iff_true, in1, in2, bc); + Node* res_fast = (bc == Bytecodes::_idiv || bc == Bytecodes::_ldiv) ? _gvn.transform(SubNode::make(_gvn.zerocon(bt), in1, bt)) : _gvn.zerocon(bt); - Node* res_slow = generate_division(_gvn, iff_false, in1, in2, bc); + Node* merge = new RegionNode(3); merge->init_req(1, iff_true); merge->init_req(2, iff_false); record_for_igvn(merge); set_control(_gvn.transform(merge)); Node* res = new PhiNode(merge, Type::get_const_basic_type(bt)); - res->init_req(1, res_fast); - res->init_req(2, res_slow); + res->init_req(1, res_slow); + res->init_req(2, res_fast); res = _gvn.transform(res); push_result(*this, res, bt); } This effectively disables implicit div by 0 checks, but I believe they could get repaired. Are there any opinions about this idea? ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From duke at openjdk.java.net Wed May 18 12:43:31 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Wed, 18 May 2022 12:43:31 GMT Subject: RFR: JDK-8284944: # assert(cnt++ < 40) failed: infinite cycle in loop optimization Message-ID: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. ------------- Commit messages: - JDK-8284944: # assert(cnt++ < 40) failed: infinite cycle in loop optimization Changes: https://git.openjdk.java.net/jdk/pull/8767/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8767&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8284944 Stats: 122 lines in 2 files changed: 120 ins; 2 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8767.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8767/head:pull/8767 PR: https://git.openjdk.java.net/jdk/pull/8767 From duke at openjdk.java.net Wed May 18 13:06:45 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Wed, 18 May 2022 13:06:45 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> Message-ID: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> > `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. > > I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: reformat spaces in test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8767/files - new: https://git.openjdk.java.net/jdk/pull/8767/files/e50f9442..73871c70 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8767&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8767&range=00-01 Stats: 93 lines in 1 file changed: 5 ins; 0 del; 88 mod Patch: https://git.openjdk.java.net/jdk/pull/8767.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8767/head:pull/8767 PR: https://git.openjdk.java.net/jdk/pull/8767 From roland at openjdk.java.net Wed May 18 13:32:50 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Wed, 18 May 2022 13:32:50 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v2] In-Reply-To: <3hWYncRMCaWRwoXJA-jSHTJ9yzl_puvIrGDAwZlu9mE=.5ab1cda1-4ac1-4121-93a4-358ac6b80f76@github.com> References: <3hWYncRMCaWRwoXJA-jSHTJ9yzl_puvIrGDAwZlu9mE=.5ab1cda1-4ac1-4121-93a4-358ac6b80f76@github.com> Message-ID: On Mon, 16 May 2022 22:47:41 GMT, Xin Liu wrote: >>> Looks very good! >> >> Thanks for the review. > > hi, @rwestrel , > I see this patch uses covariant return in a few places, eg. > > - virtual const Type *cast_to_exactness(bool klass_is_exact) const; > + virtual const TypeInstPtr* cast_to_exactness(bool klass_is_exact) const; > > > and > > > // Speculative type helper methods. > - virtual const Type* remove_speculative() const; > + virtual const TypeOopPtr* remove_speculative() const; > > > This contradicts "Avoid covariant return types." from [hotspot-style](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md). Is there any particular reason to change like that? I see that compile.cpp leverages that to improve expressiveness. Hi @navyxliu, > This contradicts "Avoid covariant return types." from [hotspot-style](https://github.com/openjdk/jdk/blob/master/doc/hotspot-style.md). Is there any particular reason to change like that? I see that compile.cpp leverages that to improve expressiveness. I missed that. I will work on a subsequent change to remove the covariant return types. ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From thartmann at openjdk.java.net Wed May 18 14:43:20 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 14:43:20 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed Message-ID: We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. Gory details below. Relevant graph after parsing: 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 ``` 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * ``` Then `1724 Phi` is replaced by the unique input `1730 CastPP`: 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * Now `1459 CastPP` is replaced by identical `197 CastPP`: 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? Thanks, Tobias ------------- Commit messages: - 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed Changes: https://git.openjdk.java.net/jdk/pull/8770/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8770&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8280696 Stats: 14 lines in 1 file changed: 4 ins; 3 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/8770.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8770/head:pull/8770 PR: https://git.openjdk.java.net/jdk/pull/8770 From duke at openjdk.java.net Wed May 18 14:59:49 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 14:59:49 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v2] In-Reply-To: References: Message-ID: > Hi, > > This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 > > 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from > > ucomiss xmm0, xmm0 > jp label > jne label > > into > > ucomiss xmm0, xmm0 > jp label > > 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as > > xorl ecx, ecx > ucomiss xmm0, xmm1 > jnp done > pushf > andq [rsp], 0xffffff2b > popf > done: > movl eax, 1 > cmovel eax, ecx > > The patch changes this sequence into > > xorl ecx, ecx > ucomiss xmm0, xmm1 > movl eax, 1 > cmovpl eax, ecx > cmovnel eax, ecx > > 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. > > The benchmark results are as follow: > > Before: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op > FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op > FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op > FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op > FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op > FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op > FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op > FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op > > After: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op > FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op > FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op > FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op > FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op > FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op > FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op > FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op > > Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: - incidental ws - add tests - Merge branch 'master' into fpcompare - fix tests - test - improve infinity - remove expensive rules - improve fp comparison ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8525/files - new: https://git.openjdk.java.net/jdk/pull/8525/files/b64e04b5..ba93dcf2 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8525&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8525&range=00-01 Stats: 210103 lines in 2627 files changed: 159508 ins; 36691 del; 13904 mod Patch: https://git.openjdk.java.net/jdk/pull/8525.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8525/head:pull/8525 PR: https://git.openjdk.java.net/jdk/pull/8525 From duke at openjdk.java.net Wed May 18 15:01:38 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 15:01:38 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:59:17 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 > > 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from > > ucomiss xmm0, xmm0 > jp label > jne label > > into > > ucomiss xmm0, xmm0 > jp label > > 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as > > xorl ecx, ecx > ucomiss xmm0, xmm1 > jnp done > pushf > andq [rsp], 0xffffff2b > popf > done: > movl eax, 1 > cmovel eax, ecx > > The patch changes this sequence into > > xorl ecx, ecx > ucomiss xmm0, xmm1 > movl eax, 1 > cmovpl eax, ecx > cmovnel eax, ecx > > 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. > > The benchmark results are as follow: > > Before: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op > FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op > FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op > FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op > FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op > FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op > FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op > FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op > > After: > Benchmark Mode Cnt Score Error Units > FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op > FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op > FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op > FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op > FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op > FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op > FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op > FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op > > Thank you very much. I have reverted the changes to `java.lang.Float` and `java.lang.Double` to not interfere with the intrinsic PR. More tests are added to cover all cases regarding floating-point comparison of compiled code. The rules for fp comparison that output the result to `rFlagRegsU` are expensive and should be avoided. As a result, I removed the shortcut rules with memory or constant operands to reduce the number of match rules. Only the basic rules are kept. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From kvn at openjdk.java.net Wed May 18 15:16:50 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 18 May 2022 15:16:50 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test Having > 40 cycles in loopopts is bug - something wrong gone there. That is what this assert for. We should detect such case in loop opts and stop early. I don't think removing this assert is correct thing. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8767 From kvn at openjdk.java.net Wed May 18 15:25:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 18 May 2022 15:25:40 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: On Wed, 18 May 2022 14:34:58 GMT, Tobias Hartmann wrote: > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias The question is why we have separate similar `Phi` for Base and Address?: 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From kvn at openjdk.java.net Wed May 18 15:29:46 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 18 May 2022 15:29:46 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: On Wed, 18 May 2022 14:34:58 GMT, Tobias Hartmann wrote: > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias I am fine with adding the check for Base`s control. But I would like to have additional investigation why we have similar Phi nodes for Base and Address. And also your suggestion about CastPP. Both in other RFEs. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8770 From thartmann at openjdk.java.net Wed May 18 15:34:59 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 15:34:59 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: <7TPpAcAnDAZ1zCHPGDEpqiza9m2DmUbGHTFk1xc13cw=.8cd622a6-1492-4a10-b08e-7e144bfa0195@github.com> On Wed, 18 May 2022 14:34:58 GMT, Tobias Hartmann wrote: > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias Thanks for the review, Vladimir! > The question is why we have separate similar Phi for Base and Address? That happens here: https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2198-L2219 ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From thartmann at openjdk.java.net Wed May 18 15:46:15 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Wed, 18 May 2022 15:46:15 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: On Wed, 18 May 2022 14:34:58 GMT, Tobias Hartmann wrote: > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias That code was introduced by JDK-8231291. Maybe @rwestrel can comment on why it's necessary to create individual Phis for base and address. ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From duke at openjdk.java.net Wed May 18 15:55:58 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 15:55:58 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v2] In-Reply-To: <6hRgLEWJfB8OHOYNJReUaMac469hvDuemoURK-aMy4Y=.963b7561-33e0-4069-8d89-8b447d4e0f0f@github.com> References: <6hRgLEWJfB8OHOYNJReUaMac469hvDuemoURK-aMy4Y=.963b7561-33e0-4069-8d89-8b447d4e0f0f@github.com> Message-ID: <9U6TeG5OxWKYDHqd4X4SvId4jPXNeG0F0KxubPYdrYQ=.1b09aef1-5de6-47b4-b674-02f521ef9825@github.com> On Wed, 4 May 2022 23:16:41 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/x86/x86_64.ad line 6998: >> >>> 6996: ins_encode %{ >>> 6997: __ cmovl(Assembler::parity, $dst$$Register, $src$$Register); >>> 6998: __ cmovl(Assembler::notEqual, $dst$$Register, $src$$Register); >> >> Should this be `equal`? > > I see that you swapped `src, dst` in `match()` but `format` is sill incorrect and the code is confusing. This is a flipping the sense of the test by flipping the input of the `CMove`, so this is essentially the same as in the above rule. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From duke at openjdk.java.net Wed May 18 15:57:43 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Wed, 18 May 2022 15:57:43 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. I have made a backout for the changes made by [JDK-8285390](https://bugs.openjdk.java.net/browse/JDK-8285390) and [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742) in #8774 , a more appropriate fix would be submitted for jdk20 as suggested by Tobias and Christian. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From mdoerr at openjdk.java.net Wed May 18 16:19:00 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 18 May 2022 16:19:00 GMT Subject: RFR: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. Closing in favor of https://github.com/openjdk/jdk/pull/8774. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From mdoerr at openjdk.java.net Wed May 18 16:19:01 2022 From: mdoerr at openjdk.java.net (Martin Doerr) Date: Wed, 18 May 2022 16:19:01 GMT Subject: Withdrawn: 8286182: C2: crash with SIGFPE when executing compiled code In-Reply-To: References: Message-ID: <_3qIvDkJJkw1qa3LVLR_DDvFFGQFuMM-IoyXLEr1BIw=.42f2ff00-4269-4d2a-8fa5-38b671a5ee02@github.com> On Mon, 16 May 2022 12:36:43 GMT, Martin Doerr wrote: > The bug is not assigned to me, but I have seen that the C2 code which checks for div by 0 is not aware of the new nodes from [JDK-8284742](https://bugs.openjdk.java.net/browse/JDK-8284742). > This fixes the VM to pass the reproducer. I'm not sure if more opcode checks are required to get added. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8726 From kvn at openjdk.java.net Wed May 18 16:41:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 18 May 2022 16:41:40 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: On Wed, 18 May 2022 15:42:36 GMT, Tobias Hartmann wrote: > That code was introduced by JDK-8231291. Maybe @rwestrel can comment on why it's necessary to create individual Phis for base and address. Bug!!!: https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2168 should use `AddPNode::Base`: if (in(i)->in(AddPNode::Base) != base) { base = NULL; } ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From xliu at openjdk.java.net Wed May 18 17:14:04 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 18 May 2022 17:14:04 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v3] In-Reply-To: References: Message-ID: On Tue, 10 May 2022 08:28:15 GMT, Roland Westrelin wrote: >> Outside the type system code itself, c2 usually assumes that a >> TypeOopPtr or a TypeKlassPtr's java type is fully represented by its >> klass(). To have proper support for interfaces, that can't be true as >> a type needs to be represented by an instance class and a set of >> interfaces. This patch hides the klass() accessor of >> TypeOopPtr/TypeKlassPtr and reworks c2 code that relies on it in a way >> that makes that code suitable for proper interface support in a >> subsequent change. This patch doesn't add proper interface support yet >> and is mostly refactoring. "Mostly" because there are cases where the >> previous logic would use a ciKlass but the new one works with a >> TypeKlassPtr/TypeInstPtr which carries the ciKlass and whether the >> klass is exact or not. That extra bit of information can sometimes >> help and so could result in slightly different decisions. >> >> To remove the klass() accessors, the new logic either relies on: >> >> - new methods of TypeKlassPtr/TypeInstPtr. For instance, instead of: >> toop->klass()->is_subtype_of(other_toop->klass()) >> the new code is: >> toop->is_java_subtype_of(other_toop) >> >> - variants of the klass() accessors for narrower cases like >> TypeInstPtr::instance_klass() (returns _klass except if _klass is an >> interface in which case it returns Object), >> TypeOopPtr::unloaded_klass() (returns _klass but only when the klass >> is unloaed), TypeOopPtr::exact_klass() (returns _klass but only when >> the type is exact). >> >> When I tested this patch, for most changes in this patch, I had the >> previous logic, the new logic and a check that verified that they >> return the same result. I ran as much testing as I could that way. > > Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits: > > - Merge branch 'master' into JDK-8275201 > - review > - Merge branch 'master' into JDK-8275201 > - Merge branch 'master' into JDK-8275201 > - build fix > - Merge branch 'master' into JDK-8275201 > - whitespaces > - remove klass accessor hi, Roland, I don't mean that you should remove it. The hotspot code style uses "avoid" instead of "forbid". I am curious what we gain when we use this feature. in some scenarios, we are operating subclass pointers. eg. in flatten_alias_type() const TypeInstPtr *to = tj->isa_instptr(); we can save static_cast<> for them. It has better for expressiveness, doesn't it? ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From duke at openjdk.java.net Wed May 18 18:58:49 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 18 May 2022 18:58:49 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v7] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. > > > JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup > > FloatClassCheck.testIsFinite 0.559 0.4 1.4x > FloatClassCheck.testIsInfinite 0.828 0.386 2.15x > FloatClassCheck.testIsNaN 2.589 0.387 6.7x > DoubleClassCheck.testIsFinite 0.568 0.414 1.37x > DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x > DoubleClassCheck.testIsNaN 2.592 0.393 6.6x > > JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup > FloatClassCheck.testIsFinite 0.561 0.468 1.2x > FloatClassCheck.testIsInfinite 0.793 0.491 1.61x > FloatClassCheck.testIsNaN 2.587 0.469 5.5x > DoubleClassCheck.testIsFinite 0.561 0.592 0.94x > DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x > DoubleClassCheck.testIsNaN 2.593 0.594 4.4x Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: zero out the upper bits not written by setb ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/0fc8679c..41aa87fc Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=05-06 Stats: 3 lines in 1 file changed: 2 ins; 1 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 18 19:23:57 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 19:23:57 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v4] In-Reply-To: References: <5yRIuUsVJKpD4zIbZJFzxGlj5AsTHtf4cUDPB9BccJo=.daaae706-3e48-43ed-921c-6d0c6dcf864b@github.com> Message-ID: On Wed, 18 May 2022 07:44:25 GMT, Srinivas Vamsi Parasa wrote: >> @vamsi-parasa Other boolean producers such as `Conv2B` or `VectorTest` widen the byte value to 32 bit after `setb` so I believe a zeroing is necessary here. You accidentally achieve correct results in tests because the boolean stores only read the least significant 8 bits of the `dst` registers. Other operations such as `test` will read full `int` and may lead to incorrect results. >> >> @jatin-bhateja Boolean is just dressed-up int except in stores so most readers will consume 32 bits of the register. >> >> Thanks. > > @merykitty You might be right. `setb` was causing a wierd jdk build error in `java.lang.HashMap` due to to error in `isNaN()`. It got fixed by clearing the higher bytes of `dst` register which is being set by `setb`. > @vamsi-parasa Other boolean producers such as `Conv2B` or `VectorTest` widen the byte value to 32 bit after `setb` so I believe a zeroing is necessary here. You accidentally achieve correct results in tests because the boolean stores only read the least significant 8 bits of the `dst` registers. Other operations such as `test` will read full `int` and may lead to incorrect results. > > @jatin-bhateja Boolean is just dressed-up int except in stores so most readers will consume 32 bits of the register. > > Thanks. For explicit widening from byte to int compiler will emit an explicit zero extension instruction (movzbl). There will not be any case where byte will be directly read along with unmodified bits a consumer. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 18 19:31:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 18 May 2022 19:31:43 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 06:57:38 GMT, Srinivas Vamsi Parasa wrote: >> test/micro/org/openjdk/bench/java/lang/DoubleClassCheck.java line 70: >> >>> 68: public void testIsFinite() { >>> 69: for (int i = 0; i < BUFFER_SIZE; i++) { >>> 70: outputs[i] = Double.isFinite(inputs[i]) ? false : true; >> >> Any specific reason to explicitly add conditional check to return true/false when isFinite returns bool value ? > > Initially, it was returning a boolean value and storing it in the output buffer. Vladimir suggested that the realword usecase of these methods is in conditions. Hence, the benchmarks were modified. Yes, that makes sense, but being a micro benchmark we micro focusing on perf gain due to this particular API, may be you can have one stand alone case also. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Wed May 18 21:53:08 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 18 May 2022 21:53:08 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 Message-ID: This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). The performance numbers are as follows: Before: Benchmark (count) Mode Cnt Score Error Units IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s After: Benchmark (count) Mode Cnt Score Error Units IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s Please review. Best Regards, Sandhya ------------- Commit messages: - fix 32 bit build - 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 Changes: https://git.openjdk.java.net/jdk/pull/8778/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286972 Stats: 152 lines in 3 files changed: 130 ins; 21 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8778.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778 PR: https://git.openjdk.java.net/jdk/pull/8778 From vlivanov at openjdk.java.net Wed May 18 23:25:42 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Wed, 18 May 2022 23:25:42 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: <-lS_36hGarJvCL26lgWyXJd-e2SuLD9g1wWL5PuoLXI=.5ddb3b74-493f-41df-8544-8a963c66fc5d@github.com> On Fri, 13 May 2022 08:24:21 GMT, Jatin Bhateja wrote: > LUT should be generated only if UsePopCountInsturction is false Should there be `!UsePopCountInsturction` check then? > restrict the scope of flag to only scalar popcount operation Interesting. But AArch64 code does cover vector cases which just adds confusion. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From vlivanov at openjdk.java.net Wed May 18 23:39:53 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Wed, 18 May 2022 23:39:53 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Fri, 13 May 2022 08:24:24 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/c2compiler.cpp line 521: >> >>> 519: if (!Matcher::match_rule_supported(Op_SignumF)) return false; >>> 520: break; >>> 521: case vmIntrinsics::_VectorComExp: >> >> Why `_VectorComExp` intrinsic is special? Other vector intrinsics are handled later and in a different manner. >> >> What about `ExpandV` case? > > It was an attempt to facilitate in-lining of these APIs over targets which do not intrinsify them. I agree its not a generic fix since three APIs are piggybacking on same entry point and without the knowledge of opcode it will be inappropriate to take any call at this place, lazy intrinsification gives opportunity for some of the predications to concertize as compilation happens under closed world assumptions. Still not clear why the code is shaped the way it is. `Matcher::match_rule_supported_vector()` already checks that there are relevant matching rules. The checks require both `CompressM` and `CompressV` to be present to enable the intrinsic. Is it important? Also, it doesn't take `EnableVectorSupport` into account while all other vector intrinsics respect it. >> src/hotspot/share/opto/compile.cpp line 3416: >> >>> 3414: >>> 3415: case Op_ReverseBytesV: >>> 3416: case Op_ReverseV: { >> >> Can you elaborate, please, why it is performed so late in the optimization phase (at the very end during graph reshaping) and not during GVN? > > Its more of a chicken-egg problem here, for masked reverse operation, Reverse IR node is followed by a Blend Node, thus in such a case doing an eager Identity transform in Reverse::Identity will not work, also deferring this to blend may also not work since it could be a non-masked reverse operation, we could have handled it as a special case in inline_vector_nary_operation, but handling such special case in final graph reshaping looked more appropriate. > > https://github.com/openjdk/panama-vector/pull/182#discussion_r845678080 Do you mean it's important to apply the transformation at the right node (pick the right node as the root) and it is hard to make a decision during GVN? ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From kvn at openjdk.java.net Wed May 18 23:40:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 18 May 2022 23:40:34 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 In-Reply-To: References: Message-ID: On Wed, 18 May 2022 17:25:38 GMT, Sandhya Viswanathan wrote: > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2285: > 2283: assert((dst->encoding() < 16) && (src1->encoding() < 16) && (src2->encoding() < 16), > 2284: "XMM register should be 0-15"); > 2285: } This whole block could be under `#ifdef ASSERT`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Wed May 18 23:52:46 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Wed, 18 May 2022 23:52:46 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v5] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 07:50:39 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8283894: Updating test tag spec. Looks good to me. ------------- Marked as reviewed by sviswanathan (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Wed May 18 23:55:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 18 May 2022 23:55:34 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 In-Reply-To: References: Message-ID: On Wed, 18 May 2022 17:25:38 GMT, Sandhya Viswanathan wrote: > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2282: > 2280: bool is_bw_supported = VM_Version::supports_avx512bw(); > 2281: if (is_bw && !is_bw_supported) { > 2282: assert(vlen_enc != Assembler::AVX_512bit, "required"); What are acceptable values of `vlen_enc`? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2314: > 2312: } else { > 2313: assert(vlen_enc != Assembler::AVX_512bit, "required"); > 2314: assert((dst->encoding() < 16),"XMM register should be 0-15"); The `} else {` case will be also executed on on KNL CPU. Did you tested with `-XX:+UseKNLSetting`? src/hotspot/cpu/x86/x86.ad line 1475: > 1473: return false; > 1474: } > 1475: // fallthrough Please, don't do `fallthrough` - `RoundVF` is not related to `PopulateIndex`. Why not `if (!is_LP64 || UseAVX < 2)`? Are there limitations in 32 bits or you don't want spend time on not major platform (which is also understandable)? src/hotspot/cpu/x86/x86.ad line 1821: > 1819: case Op_PopulateIndex: > 1820: if (size_in_bits > 256 && !VM_Version::supports_avx512bw()) > 1821: return false; Use `{}` according to our style. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From duke at openjdk.java.net Thu May 19 00:06:00 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 19 May 2022 00:06:00 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. > > > JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup > > FloatClassCheck.testIsFinite 0.559 0.4 1.4x > FloatClassCheck.testIsInfinite 0.828 0.386 2.15x > FloatClassCheck.testIsNaN 2.589 0.387 6.7x > DoubleClassCheck.testIsFinite 0.568 0.414 1.37x > DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x > DoubleClassCheck.testIsNaN 2.592 0.393 6.6x > > JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup > FloatClassCheck.testIsFinite 0.561 0.468 1.2x > FloatClassCheck.testIsInfinite 0.793 0.491 1.61x > FloatClassCheck.testIsNaN 2.587 0.469 5.5x > DoubleClassCheck.testIsFinite 0.561 0.592 0.94x > DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x > DoubleClassCheck.testIsNaN 2.593 0.594 4.4x Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: - add comment for vfpclasss/d for isFinite() - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - zero out the upper bits not written by setb - use 0x1 to be simpler - remove the redundant temp register - Split the macros using predicate - update jmh tests - Merge branch 'master' into float - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite ------------- Changes: https://git.openjdk.java.net/jdk/pull/8459/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=07 Stats: 772 lines in 20 files changed: 770 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From sviswanathan at openjdk.java.net Thu May 19 00:07:40 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 00:07:40 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 In-Reply-To: References: Message-ID: <7iZwwMFnJyA0fZ-exsX3xI3YINMPi8plVG2ycDuE6lA=.0bf546f2-96cf-405b-b58e-5b5689f9b6d8@github.com> On Wed, 18 May 2022 23:38:25 GMT, Vladimir Kozlov wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2282: > >> 2280: bool is_bw_supported = VM_Version::supports_avx512bw(); >> 2281: if (is_bw && !is_bw_supported) { >> 2282: assert(vlen_enc != Assembler::AVX_512bit, "required"); > > What are acceptable values of `vlen_enc`? For KNL, PopulateIndex support is limited to 256-bit as we need avx512bw() for the 512-bit support. For other AVX2 and AVX512 architectures, all vector widths up to and including 512-bit are supported. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Thu May 19 00:13:47 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 00:13:47 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 In-Reply-To: References: Message-ID: On Wed, 18 May 2022 23:51:07 GMT, Vladimir Kozlov wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > src/hotspot/cpu/x86/x86.ad line 1475: > >> 1473: return false; >> 1474: } >> 1475: // fallthrough > > Please, don't do `fallthrough` - `RoundVF` is not related to `PopulateIndex`. Why not `if (!is_LP64 || UseAVX < 2)`? > Are there limitations in 32 bits or you don't want spend time on not major platform (which is also understandable)? It is not a major platform so I didn't spend time on it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Thu May 19 00:25:33 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 00:25:33 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: References: Message-ID: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: review comment resolution ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8778/files - new: https://git.openjdk.java.net/jdk/pull/8778/files/a21939ea..ab07fae9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=00-01 Stats: 6 lines in 2 files changed: 3 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8778.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778 PR: https://git.openjdk.java.net/jdk/pull/8778 From pli at openjdk.java.net Thu May 19 00:56:34 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Thu, 19 May 2022 00:56:34 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 00:25:33 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2278: > 2276: > 2277: void C2_MacroAssembler::vpadd(BasicType elem_bt, XMMRegister dst, XMMRegister src1, XMMRegister src2, int vlen_enc) { > 2278: assert(UseAVX >= 2, "required"); Why not include this line in #ifdef ? ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From xgong at openjdk.java.net Thu May 19 01:51:41 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 19 May 2022 01:51:41 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: Message-ID: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> On Wed, 18 May 2022 07:00:53 GMT, Haomin wrote: >> src/hotspot/share/opto/vectornode.cpp line 158: >> >>> 156: default: return 0; // RotateLeftV for byte, short values produces incorrect Java result. >>> 157: // Because java code should convert a byte, short value into int value, >>> 158: // and then do RotateI. >> >> I?m afraid this will influence the VectorAPI intrinsification for `ByteVector `and `ShortVector`. Did you test the benchmarks for these two APIs with the subword type? > > I have tested `hotspot_vector_1` and `jdk_vector`, all pass. Do you mean the benchmarks? Currently the vectorapi benchmarks only exist on panama-vector, please see: https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Thu May 19 02:22:36 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 19 May 2022 02:22:36 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 00:06:00 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - add comment for vfpclasss/d for isFinite() > - Merge branch 'master' of https://git.openjdk.java.net/jdk into float > - zero out the upper bits not written by setb > - use 0x1 to be simpler > - remove the redundant temp register > - Split the macros using predicate > - update jmh tests > - Merge branch 'master' into float > - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite Please remove `isNaN` intrinsics in favour of #8525 . Also, you should not use `andl(dst, 0xff)` to zero out the upper bits of `dst` since it is a 32-bit read following a 8-bit write, constitute a partial register stall Refer to section 3.5.2.4, Partial register stalls from Intel? 64 and IA-32 Architectures Optimization Reference Manual: > A partial register stall happens when an instruction refers to a register, portions of which were previously modified by other instructions. There are 2 options worth considering here - Zeroing the register before `setb` instruction, referring to the same section > For optimal performance, use of zero idioms, before the use of the register, eliminates the need for partial register merge micro-ops This is more preferable since it does not contribute an execution uop in the backend (but still consumes a slot in the decoder and uop cache) - Zero extending the register after the `setb` instruction. This is less optimal since it has an extra latency of zero extension and adding a real executed uop in the backend. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Thu May 19 02:42:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 02:42:38 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 00:25:33 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution Looks good. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Thu May 19 02:42:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 02:42:39 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 00:44:42 GMT, Pengfei Li wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment resolution > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2278: > >> 2276: >> 2277: void C2_MacroAssembler::vpadd(BasicType elem_bt, XMMRegister dst, XMMRegister src1, XMMRegister src2, int vlen_enc) { >> 2278: assert(UseAVX >= 2, "required"); > > Why not include this line in #ifdef ? It does not matter since it is assert which add code only in debug VM. I like this way. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Thu May 19 02:42:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 02:42:40 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: <7iZwwMFnJyA0fZ-exsX3xI3YINMPi8plVG2ycDuE6lA=.0bf546f2-96cf-405b-b58e-5b5689f9b6d8@github.com> References: <7iZwwMFnJyA0fZ-exsX3xI3YINMPi8plVG2ycDuE6lA=.0bf546f2-96cf-405b-b58e-5b5689f9b6d8@github.com> Message-ID: On Thu, 19 May 2022 00:04:05 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2282: >> >>> 2280: bool is_bw_supported = VM_Version::supports_avx512bw(); >>> 2281: if (is_bw && !is_bw_supported) { >>> 2282: assert(vlen_enc != Assembler::AVX_512bit, "required"); >> >> What are acceptable values of `vlen_enc`? > > For KNL, PopulateIndex support is limited to 256-bit as we need avx512bw() for the 512-bit support. > For other AVX2 and AVX512 architectures, all vector widths up to and including 512-bit are supported. Okay. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Thu May 19 03:02:49 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 03:02:49 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 00:25:33 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution @sviswa7 Can you add IR framework test to verify generation of PopulateIndex node? And regression test. I see that [8280510](https://bugs.openjdk.java.net/browse/JDK-8280510) added only microbenchmark. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Thu May 19 03:15:38 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 03:15:38 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 02:59:10 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment resolution > > @sviswa7 Can you add IR framework test to verify generation of PopulateIndex node? And regression test. > I see that [8280510](https://bugs.openjdk.java.net/browse/JDK-8280510) added only microbenchmark. @vnkozlov I will look into adding the IR framework and regression test. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2314: > >> 2312: } else { >> 2313: assert(vlen_enc != Assembler::AVX_512bit, "required"); >> 2314: assert((dst->encoding() < 16),"XMM register should be 0-15"); > > The `} else {` case will be also executed on on KNL CPU. Did you tested with `-XX:+UseKNLSetting`? Yes, this part will be executed on KNL CPU. I did run the compiler tests with UseKNLSetting and didn't see any issue. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From duke at openjdk.java.net Thu May 19 04:31:57 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 19 May 2022 04:31:57 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 02:19:09 GMT, Quan Anh Mai wrote: > Please remove `isNaN` intrinsics in favour of #8525 . > > Also, you should not use `andl(dst, 0xff)` to zero out the upper bits of `dst` since it is a 32-bit read following a 8-bit write, constitute a partial register stall > > Refer to section 3.5.2.4, Partial register stalls from Intel? 64 and IA-32 Architectures Optimization Reference Manual: > > > A partial register stall happens when an instruction refers to a register, portions of which were previously modified by other instructions. > > There are 2 options worth considering here > > * Zeroing the register before `setb` instruction, referring to the same section > > For optimal performance, use of zero idioms, before the use of the register, eliminates the need for partial register > > merge micro-ops > > > This is more preferable since it does not contribute an execution uop in the backend (but still consumes a slot in the > decoder and uop cache) > * Zero extending the register after the `setb` instruction. This is less optimal since it has an extra latency of zero extension and adding a real executed uop in the backend. > > Thanks. #8525 seems to be eliminating the flags register fixup for `IsNaN()`. These intrinsics can show speedup over `ucomiss` instructions. Also, having the intrinsic can be used for future vectorization. So, we can keep the `IsNaN()` intrinsic along with your improvement. Both are orthogonal, not mutually exclusive. Actually, `andl(dst, 0xff)` is giving speedup over zeroing out the register before `setb`. Also, would a 32bit logical `and` of all bits cause the problem you mentioned? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu May 19 04:31:58 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 19 May 2022 04:31:58 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 19:28:21 GMT, Jatin Bhateja wrote: >> Initially, it was returning a boolean value and storing it in the output buffer. Vladimir suggested that the realword usecase of these methods is in conditions. Hence, the benchmarks were modified. > > Yes, that makes sense, but being a micro benchmark we micro focusing on perf gain due to this particular API, may be you can have one stand alone case also. Sure, will add the standalone case also. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu May 19 05:24:43 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Thu, 19 May 2022 05:24:43 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 00:06:00 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - add comment for vfpclasss/d for isFinite() > - Merge branch 'master' of https://git.openjdk.java.net/jdk into float > - zero out the upper bits not written by setb > - use 0x1 to be simpler > - remove the redundant temp register > - Split the macros using predicate > - update jmh tests > - Merge branch 'master' into float > - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite It sounds strange, please show the asm of your patch with respects to the benchmark. Also, please try cmoving with other arbitrary values such as 19 and 7 instead of `false` and `true`. The latter may be recognised as simple boolean not operation, remove the real comparison part, which defeats the purpose of Vladimir's suggestion. Regarding vectorisation, `isNaN` is a simple comparison and can be easily auto-vectorised without help from intrinsics. My speculation: A native comparison such as `x != x` can be parsed directly by the compiler. As a result, the graph of the expression `if (x != x)` is simply CmpF | Bool | If Your intrinsics, on the other hand, do not return the results on the flags, which leads to an extra comparison when using in conditions, `if(isNaN(x))` becomes IsNaN 0 \ / CmpI | Bool | If In your benchmark, however, using this comparison to cmoving between 0 and 1 (`false` and `true`), the compiler recognised the pattern `x != 0 ? 0 : 1` with `x` having the type of `TypeInt::BOOL`. As a result, it reduces the graph into IsNaN 1 \ / XorI Personally, I'm not into this implementation of intrinsics. FYI, gcc and clang both use sequences similar to `x != x` for `std::isnan`, `Math.abs(x) <= MAX_VALUE` for `std::isfinite` and `Math.abs(x) > MAX_VALUE` for `std::isinf`. The first one reduces to a single instruction `ucomiss x, x` so there is no reason to optimise further. The others are compiled down to 2 instructions each `vandpd t, x, [SIGN_ELIMINATE]; ucomiss t, [MAX_VALUE]`, so to optimise these further requires careful assessments. If you feel comfortable I would suggest you build the graph for these intrinsics as X | Bool 1 0 \ | / CMove Then we can add ideal rules to `BoolNode` to recognise the patterns X | Bool 1 0 \ | / CMove 0 \ / CmpI | Bool And reduce them to X | Bool With this, we can have the `Double::isInfinite` intrinsics compiled down to `vfpclass k, x; ktest k`, which is much more preferable. For non-AVX512DQ though I would prefer implementing them in Java similar to described above. Both abs and comparison nodes are not hard to be vectorised so it would not be a problem. Thanks a lot. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu May 19 05:51:11 2022 From: duke at openjdk.java.net (yuta) Date: Thu, 19 May 2022 05:51:11 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive Message-ID: When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , it shows totally the same message for c1 and c2 compiler and the user would be confused about which compiler is affected by this message. This should show messages with their compiler name so that the user knows which compiler shows this message. My change result would be like the below. OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output -> OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output ------------- Commit messages: - Add compiler name to warning messages in Compiler Directive Changes: https://git.openjdk.java.net/jdk/pull/8591/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8591&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286990 Stats: 23 lines in 2 files changed: 18 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/8591.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8591/head:pull/8591 PR: https://git.openjdk.java.net/jdk/pull/8591 From thartmann at openjdk.java.net Thu May 19 06:03:27 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 06:03:27 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed [v2] In-Reply-To: References: Message-ID: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: Fixed wrong node input in PhiNode::Ideal ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8770/files - new: https://git.openjdk.java.net/jdk/pull/8770/files/964d2b77..ed0171ab Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8770&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8770&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8770.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8770/head:pull/8770 PR: https://git.openjdk.java.net/jdk/pull/8770 From thartmann at openjdk.java.net Thu May 19 06:03:27 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 06:03:27 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: <1yvajqARaw8SY0645CSZKdtdQ3xXaSJm-v2CygrcFCE=.8513d6df-3220-4ec1-8321-69f77ead2387@github.com> On Wed, 18 May 2022 14:34:58 GMT, Tobias Hartmann wrote: > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias Oh, good catch! I fixed that as well (the original issue still reproduces but requires a different seed). ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From jbhateja at openjdk.java.net Thu May 19 06:27:42 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 19 May 2022 06:27:42 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 04:27:11 GMT, Srinivas Vamsi Parasa wrote: > * Zeroing the register before `setb` instruction, referring to the same section > > For optimal performance, use of zero idioms, before the use of the register, eliminates the need for partial register > > merge micro-ops > > > This is more preferable since it does not contribute an execution uop in the backend (but still consumes a slot in the > decoder and uop cache) Agree. even a prior, movl REG, 0 can be used, it never reaches execution ports. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu May 19 06:59:19 2022 From: duke at openjdk.java.net (yuta) Date: Thu, 19 May 2022 06:59:19 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries Message-ID: When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), there is no warning message (only can see info level messages if put -Xlog:os=info). This should show a warning message to tell the user that you failed to load libraries for hsdis. So I put a warning message to notify this. e.g. ` ------------- Commit messages: - add warning message when fail to load hsdis libraries Changes: https://git.openjdk.java.net/jdk/pull/8782/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8782&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287001 Stats: 2 lines in 1 file changed: 2 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8782.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8782/head:pull/8782 PR: https://git.openjdk.java.net/jdk/pull/8782 From jbhateja at openjdk.java.net Thu May 19 07:12:50 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 19 May 2022 07:12:50 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 00:06:00 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. >> >> >> JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup >> >> FloatClassCheck.testIsFinite 0.559 0.4 1.4x >> FloatClassCheck.testIsInfinite 0.828 0.386 2.15x >> FloatClassCheck.testIsNaN 2.589 0.387 6.7x >> DoubleClassCheck.testIsFinite 0.568 0.414 1.37x >> DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x >> DoubleClassCheck.testIsNaN 2.592 0.393 6.6x >> >> JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup >> FloatClassCheck.testIsFinite 0.561 0.468 1.2x >> FloatClassCheck.testIsInfinite 0.793 0.491 1.61x >> FloatClassCheck.testIsNaN 2.587 0.469 5.5x >> DoubleClassCheck.testIsFinite 0.561 0.592 0.94x >> DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x >> DoubleClassCheck.testIsNaN 2.593 0.594 4.4x > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: > > - add comment for vfpclasss/d for isFinite() > - Merge branch 'master' of https://git.openjdk.java.net/jdk into float > - zero out the upper bits not written by setb > - use 0x1 to be simpler > - remove the redundant temp register > - Split the macros using predicate > - update jmh tests > - Merge branch 'master' into float > - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4750: > 4748: movdl(temp, src); > 4749: andl(temp, KILL_SIGN_MASK); > 4750: cmpl(temp, POS_INF); For IsNaN following sequence will offer better latency "vucomiss src_xmm, src_xmm" "setp r8" ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From thartmann at openjdk.java.net Thu May 19 07:19:56 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 07:19:56 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 12 May 2022 21:27:30 GMT, Xin Liu wrote: >> I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. >> >> This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. >> >> This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. >> >> Before: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op >> >> After: >> >> Benchmark Mode Cnt Score Error Units >> MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op >> ``` >> >> Testing >> I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. > > Xin Liu has updated the pull request incrementally with 11 additional commits since the last revision: > > - revert code change from 1st revision. > - Merge branch 'JDK-8276998' into JDK-8286104 > - rule out if a If nodes has 2 branches of unstable_if trap. > - change the flag to diagnostic. > - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull > - fix release build > - update unstable_if after igvn. > - adjust unstable_if after fold_compares > - disable comparison_folding temporarily. > > This feature not only folds two CMPI but also merge two uncommon_traps. > it uses the dominating uncommon_trap and revaluate the two if in > interpreter. currently, aggressiveliveness can't work for that. > - retain bci for unstable_if > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/2c38b87b...2f047457 I ran this through some quick testing and `test/hotspot/jtreg/compiler/rangechecks/TestExplicitRangeChecks.java` fails: java.lang.reflect.InvocationTargetException at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:116) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at compiler.rangechecks.TestExplicitRangeChecks.doTest(TestExplicitRangeChecks.java:441) at compiler.rangechecks.TestExplicitRangeChecks.main(TestExplicitRangeChecks.java:518) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) at java.base/java.lang.reflect.Method.invoke(Method.java:578) at com.sun.javatest.regtest.agent.MainWrapper$MainThread.run(MainWrapper.java:127) at java.base/java.lang.Thread.run(Thread.java:1585) Caused by: java.lang.NullPointerException: Cannot read the array length because "" is null at compiler.rangechecks.TestExplicitRangeChecks.test3_2(TestExplicitRangeChecks.java:113) at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:104) ... 7 more ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From thartmann at openjdk.java.net Thu May 19 07:23:50 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 07:23:50 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 20:19:14 GMT, Brian J. Stafford wrote: >> The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: >> >> - Asserted the condition that makes sure the code runs at least once >> - Asserted the condition that makes sure the code runs only once >> - Removed the `while` loop >> - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. >> - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. >> >> Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. > > Brian J. Stafford has updated the pull request incrementally with two additional commits since the last revision: > > - Removed whitespace > - Added braces for if statements Looks good to me and tests passed. @robcasloz should also have a look. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8684 From roland at openjdk.java.net Thu May 19 07:39:59 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 19 May 2022 07:39:59 GMT Subject: RFR: 8275201: C2: hide klass() accessor from TypeOopPtr and typeKlassPtr subclasses [v3] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 17:10:36 GMT, Xin Liu wrote: > I don't mean that you should remove it. The hotspot code style uses "avoid" instead of "forbid". I am curious what we gain when we use this feature. in some scenarios, we are operating subclass pointers. eg. in flatten_alias_type() const TypeInstPtr *to = tj->isa_instptr(); > > we can save static_cast<> for them. It has better for expressiveness, doesn't it? The usual pattern in c2 code is to use of the isa_xxx()/is_xxx() methods to downcast. The covariant return types made it possible to remove some of those calls resulting in less clutter. Anyway, it was mostly convenience and it's a part of the change that can easily be adjusted. ------------- PR: https://git.openjdk.java.net/jdk/pull/6717 From rcastanedalo at openjdk.java.net Thu May 19 08:02:48 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 19 May 2022 08:02:48 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: Message-ID: <9tV7TXcuMDkKb0tIBZnIzzwiAp3knfEnad8t8d5p0Q8=.e7dfb533-e5c1-42ac-b8a6-61ac7b59e1db@github.com> On Thu, 19 May 2022 07:20:43 GMT, Tobias Hartmann wrote: > Looks good to me and tests passed. @robcasloz should also have a look. Running some additional tests, will come back with the results. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From njian at openjdk.java.net Thu May 19 08:57:07 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Thu, 19 May 2022 08:57:07 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6] In-Reply-To: <-lS_36hGarJvCL26lgWyXJd-e2SuLD9g1wWL5PuoLXI=.5ddb3b74-493f-41df-8544-8a963c66fc5d@github.com> References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> <-lS_36hGarJvCL26lgWyXJd-e2SuLD9g1wWL5PuoLXI=.5ddb3b74-493f-41df-8544-8a963c66fc5d@github.com> Message-ID: On Wed, 18 May 2022 23:22:42 GMT, Vladimir Ivanov wrote: > Interesting. But AArch64 code does cover vector cases which just adds confusion. `UsePopCountInsturction` is always true in AArch64. @XiaohongGong removed the `predicate` in aarch64 rules, and I think we can even remove the option check in match_rule_supported(). ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From chagedorn at openjdk.java.net Thu May 19 09:02:55 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Thu, 19 May 2022 09:02:55 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed [v2] In-Reply-To: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> References: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> Message-ID: On Thu, 19 May 2022 06:03:27 GMT, Tobias Hartmann wrote: >> We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: >> ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) >> >> I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 >> This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. >> >> As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. >> >> The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. >> >> Gory details below. >> >> Relevant graph after parsing: >> >> 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: >> >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 >> >> ``` >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> ``` >> Then `1724 Phi` is replaced by the unique input `1730 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Now `1459 CastPP` is replaced by identical `197 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> >> Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Fixed wrong node input in PhiNode::Ideal Nice analysis and good catch by Vladimir! Looks good to me, too. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8770 From xgong at openjdk.java.net Thu May 19 09:06:17 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Thu, 19 May 2022 09:06:17 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v6] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> <-lS_36hGarJvCL26lgWyXJd-e2SuLD9g1wWL5PuoLXI=.5ddb3b74-493f-41df-8544-8a963c66fc5d@github.com> Message-ID: On Thu, 19 May 2022 08:53:31 GMT, Ningsheng Jian wrote: >>> LUT should be generated only if UsePopCountInsturction is false >> >> Should there be `!UsePopCountInsturction` check then? >> >>> restrict the scope of flag to only scalar popcount operation >> >> Interesting. But AArch64 code does cover vector cases which just adds confusion. > >> Interesting. But AArch64 code does cover vector cases which just adds confusion. > > `UsePopCountInsturction` is always true in AArch64. @XiaohongGong removed the `predicate` in aarch64 rules, and I think we can even remove the option check in match_rule_supported(). Ok , I will remove the check for it. Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From roland at openjdk.java.net Thu May 19 09:09:45 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Thu, 19 May 2022 09:09:45 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed [v2] In-Reply-To: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> References: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> Message-ID: On Thu, 19 May 2022 06:03:27 GMT, Tobias Hartmann wrote: >> We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: >> ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) >> >> I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 >> This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. >> >> As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. >> >> The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. >> >> Gory details below. >> >> Relevant graph after parsing: >> >> 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: >> >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 >> >> ``` >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> ``` >> Then `1724 Phi` is replaced by the unique input `1730 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Now `1459 CastPP` is replaced by identical `197 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> >> Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Fixed wrong node input in PhiNode::Ideal Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8770 From thartmann at openjdk.java.net Thu May 19 09:42:41 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 09:42:41 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed [v2] In-Reply-To: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> References: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> Message-ID: On Thu, 19 May 2022 06:03:27 GMT, Tobias Hartmann wrote: >> We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: >> ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) >> >> I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 >> This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. >> >> As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. >> >> The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. >> >> Gory details below. >> >> Relevant graph after parsing: >> >> 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: >> >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 >> >> ``` >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> ``` >> Then `1724 Phi` is replaced by the unique input `1730 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Now `1459 CastPP` is replaced by identical `197 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> >> Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Fixed wrong node input in PhiNode::Ideal Christian, Roland, thanks for the reviews! As Vladimir requested, I filed a follow-up RFE ([JDK-8287009](https://bugs.openjdk.java.net/browse/JDK-8287009)) for the useless CastPPs. I think the code creating two Phis for the Base and Address inputs is fine because they can be different but I leave it to @rwestrel to comment on that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From rcastanedalo at openjdk.java.net Thu May 19 12:10:40 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Thu, 19 May 2022 12:10:40 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: Message-ID: <3s-MkuRpH7XNy0XyBp0lVdz1LWBs2VzAtASR4YTT9yo=.85632fe2-dcc9-4bae-b4e3-dd011014ab08@github.com> On Tue, 17 May 2022 20:19:14 GMT, Brian J. Stafford wrote: >> The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: >> >> - Asserted the condition that makes sure the code runs at least once >> - Asserted the condition that makes sure the code runs only once >> - Removed the `while` loop >> - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. >> - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. >> >> Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. > > Brian J. Stafford has updated the pull request incrementally with two additional commits since the last revision: > > - Removed whitespace > - Added braces for if statements Thanks for working on this cleanup, @brianjstafford! The loop removal looks good to me, and the additional tests passed, I just have a few comments about the assertions and their documentation. src/hotspot/share/opto/lcm.cpp line 330: > 328: // Give up hoisting if we have to move the store past any load. > 329: if (was_store) { > 330: // Start searching here for a local load This comment is obsolete and can be removed, as it implicitly refers to a loop that no longer exists. src/hotspot/share/opto/lcm.cpp line 333: > 331: // mach use (faulting) trying to hoist > 332: // n might be blocker to hoisting > 333: // This assert ensures that the following code should be run This comment and the similar one for the second assertion are also obsolete for the same reason. Instead, you could add a comment explaining why we expect `get_block_for_node(mb->pred(1)) == block`. I suggest something along these lines: _`mach` is a store, hence `block` is the immediate dominator of `mb`. Due to the null-check shape of `block` (where its successors cannot re-join), `block` must be the direct predecessor of `mb`._ src/hotspot/share/opto/lcm.cpp line 352: > 350: > 351: // This assert ensures that the above code should only be run once > 352: assert(get_block_for_node(mb->pred(1)) == block, "Unexpected predecessor block"); The single-predecessor test and the assertion `get_block_for_node(mb->pred(1)) == block` can be moved above the for-loop to make the assertion `mb != block` redundant. ------------- Changes requested by rcastanedalo (Committer). PR: https://git.openjdk.java.net/jdk/pull/8684 From mbaesken at openjdk.java.net Thu May 19 13:40:58 2022 From: mbaesken at openjdk.java.net (Matthias Baesken) Date: Thu, 19 May 2022 13:40:58 GMT Subject: RFR: 8285733: [s390] Vector Instruction Emitters for element-wise access are broken In-Reply-To: References: Message-ID: <70rNEIGo3KjU3KSrituzsA0ooWm3IyDqmrgEz-KRp7o=.b7fb34f8-fb8d-4431-86a5-7dfa4ba1ecff@github.com> On Wed, 4 May 2022 14:42:27 GMT, Lutz Schmidt wrote: > Please review this rather simple pull request. It fixes some vector instruction emitters. The bugs had gone unnoticed so far because the emitters had not been used. Therefore, the fix bears no risk. > > Testing was performed with new code currently under development. Marked as reviewed by mbaesken (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8537 From lucy at openjdk.java.net Thu May 19 14:01:03 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 19 May 2022 14:01:03 GMT Subject: RFR: 8285733: [s390] Vector Instruction Emitters for element-wise access are broken In-Reply-To: <70rNEIGo3KjU3KSrituzsA0ooWm3IyDqmrgEz-KRp7o=.b7fb34f8-fb8d-4431-86a5-7dfa4ba1ecff@github.com> References: <70rNEIGo3KjU3KSrituzsA0ooWm3IyDqmrgEz-KRp7o=.b7fb34f8-fb8d-4431-86a5-7dfa4ba1ecff@github.com> Message-ID: On Thu, 19 May 2022 13:37:51 GMT, Matthias Baesken wrote: >> Please review this rather simple pull request. It fixes some vector instruction emitters. The bugs had gone unnoticed so far because the emitters had not been used. Therefore, the fix bears no risk. >> >> Testing was performed with new code currently under development. > > Marked as reviewed by mbaesken (Reviewer). Thank you very much for reviewing, @MBaesken ! ------------- PR: https://git.openjdk.java.net/jdk/pull/8537 From lucy at openjdk.java.net Thu May 19 14:01:04 2022 From: lucy at openjdk.java.net (Lutz Schmidt) Date: Thu, 19 May 2022 14:01:04 GMT Subject: Integrated: 8285733: [s390] Vector Instruction Emitters for element-wise access are broken In-Reply-To: References: Message-ID: On Wed, 4 May 2022 14:42:27 GMT, Lutz Schmidt wrote: > Please review this rather simple pull request. It fixes some vector instruction emitters. The bugs had gone unnoticed so far because the emitters had not been used. Therefore, the fix bears no risk. > > Testing was performed with new code currently under development. This pull request has now been integrated. Changeset: af7cda5d Author: Lutz Schmidt URL: https://git.openjdk.java.net/jdk/commit/af7cda5d8f1f724f183f6ec85ca9edf6afb2d478 Stats: 20 lines in 1 file changed: 1 ins; 0 del; 19 mod 8285733: [s390] Vector Instruction Emitters for element-wise access are broken Reviewed-by: mdoerr, mbaesken ------------- PR: https://git.openjdk.java.net/jdk/pull/8537 From eliu at openjdk.java.net Thu May 19 14:31:41 2022 From: eliu at openjdk.java.net (Eric Liu) Date: Thu, 19 May 2022 14:31:41 GMT Subject: RFR: 8287028: AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 Message-ID: <9f4FuUVXKxeO6tC6so96ydn3nss81T7s0KvV03XlnCc=.75152f52-5b9f-4a84-bd36-0547899fa061@github.com> This patch implements AArch64 codegen for VectorLongToMask using the SVE2 BitPerm feature. With this patch, the final code (generated on an SVE vector reg size of 512-bit QEMU emulator) is shown as below: mov z17.b, #0 mov v17.d[0], x13 sunpklo z17.h, z17.b sunpklo z17.s, z17.h sunpklo z17.d, z17.s mov z16.b, #1 bdep z17.d, z17.d, z16.d cmpne p0.b, p7/z, z17.b, #0 ------------- Commit messages: - AArch64: [vectorapi] Backend implementation of VectorMask.fromLong with SVE2 Changes: https://git.openjdk.java.net/jdk/pull/8789/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8789&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287028 Stats: 134 lines in 8 files changed: 102 ins; 0 del; 32 mod Patch: https://git.openjdk.java.net/jdk/pull/8789.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8789/head:pull/8789 PR: https://git.openjdk.java.net/jdk/pull/8789 From kvn at openjdk.java.net Thu May 19 14:40:51 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 14:40:51 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 00:25:33 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution Tier1-4 testing passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Thu May 19 14:46:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 14:46:45 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed [v2] In-Reply-To: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> References: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> Message-ID: On Thu, 19 May 2022 06:03:27 GMT, Tobias Hartmann wrote: >> We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: >> ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) >> >> I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 >> This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. >> >> As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. >> >> The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. >> >> Gory details below. >> >> Relevant graph after parsing: >> >> 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: >> >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 >> >> ``` >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> ``` >> Then `1724 Phi` is replaced by the unique input `1730 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Now `1459 CastPP` is replaced by identical `197 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> >> Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Fixed wrong node input in PhiNode::Ideal Update looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8770 From thartmann at openjdk.java.net Thu May 19 14:50:56 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 14:50:56 GMT Subject: RFR: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed [v2] In-Reply-To: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> References: <7QkxcgsjuAx6H4eW-MkLgASBmgunUwm3j4YhO0-B4rs=.aa30a84d-8436-4037-b82e-aeb911e37c26@github.com> Message-ID: On Thu, 19 May 2022 06:03:27 GMT, Tobias Hartmann wrote: >> We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: >> ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) >> >> I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 >> This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. >> >> As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. >> >> The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. >> >> Gory details below. >> >> Relevant graph after parsing: >> >> 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: >> >> https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 >> >> ``` >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> ``` >> Then `1724 Phi` is replaced by the unique input `1730 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Now `1459 CastPP` is replaced by identical `197 CastPP`: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: >> >> 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * >> 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency >> 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * >> >> >> Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? >> >> Thanks, >> Tobias > > Tobias Hartmann has updated the pull request incrementally with one additional commit since the last revision: > > Fixed wrong node input in PhiNode::Ideal Thanks, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From thartmann at openjdk.java.net Thu May 19 14:56:58 2022 From: thartmann at openjdk.java.net (Tobias Hartmann) Date: Thu, 19 May 2022 14:56:58 GMT Subject: Integrated: 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed In-Reply-To: References: Message-ID: On Wed, 18 May 2022 14:34:58 GMT, Tobias Hartmann wrote: > We hit an assert when computing early control via `PhaseIdealLoop::compute_early_ctrl` for `1726 AddP` because one of its control inputs `1478 Region` does not dominate current control `1350 Loop` of the AddP: > ![Screenshot from 2022-05-18 15-10-24](https://user-images.githubusercontent.com/5312595/169046641-2ee94257-0aae-4ddf-bb5f-49dc19b466b3.png) > > I.e., current control of the AddP is incorrect. The problem is that the code in `PhaseIdealLoop::has_local_phi_input` that special cases AddP's only checks control of the Address (and Offset) input, assuming that control of the Base input is consistent. > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/loopopts.cpp#L333-L337 > This is not guaranteed though, leading to the AddP ending up with control that is not dominated by control of its base input. > > As described below, this only reproduces with a very specific sequence of optimizations triggered by replay compilation with `-XX:+StressIGVN` and a fixed seed. I was not able to extract a regression test. > > The fix is to also check control of the Base input when moving the AddP up to a dominating point. For testing purposes, I added an `assert(get_ctrl(m->in(1)) != n_ctrl, "sanity")` without the fix to verify that this change does not affect common cases. It triggers in the failing case but not for any test in tier 1 - 5. In addition, I slightly refactored the code of `PhaseIdealLoop::compute_early_ctrl` and added comments. > > Gory details below. > > Relevant graph after parsing: > > 197 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1724 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > `1724 Phi` is then processed by the following code in `PhiNode::Ideal` that replaces its inputs by a cast of the unique input `1730 CastPP`: > > https://github.com/openjdk/jdk/blob/aa7ccdf44549a52cce9e99f6569097d3343d9ee4/src/hotspot/share/opto/cfgnode.cpp#L2013-L2016 > > ``` > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1724 1724 ]] #java/io/BufferedReader:NotNull * strong dependency > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1724 Phi === 1478 1730 1730 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1726 AddP === _ 1724 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > ``` > Then `1724 Phi` is replaced by the unique input `1730 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1459 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 1459 197 [[ 1726 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Now `1459 CastPP` is replaced by identical `197 CastPP`: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1725 Phi === 1478 197 197 [[ 1726 1739 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 1725 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > Finally, `1725 Phi` is replaced by unique input `197 CastPP` and the AddP ends up with two casts with different control of the same oop for Base and Address: > > 197 CastPP === 1460 60 [[ ... 1725 ]] #java/io/BufferedReader:NotNull * > 1730 CastPP === 1478 60 [[ ... 1726 ]] #java/io/BufferedReader:NotNull * strong dependency > 1726 AddP === _ 1730 197 41 [[ ... ]] Oop:java/io/BufferedReader:NotNull+24 * > > > Looking at the above transformation, the root cause is really the `1730 CastPP` added by `PhiNode::Ideal` which is not needed and prevents the two casts from being merged. Is it worth filing a follow-up enhancement to fix this? > > Thanks, > Tobias This pull request has now been integrated. Changeset: fa1b56ed Author: Tobias Hartmann URL: https://git.openjdk.java.net/jdk/commit/fa1b56ede6eed653f70efbbfff3af5ee6b481ee4 Stats: 15 lines in 2 files changed: 4 ins; 3 del; 8 mod 8280696: C2 compilation hits assert(is_dominator(c, n_ctrl)) failed Reviewed-by: kvn, chagedorn, roland ------------- PR: https://git.openjdk.java.net/jdk/pull/8770 From jbhateja at openjdk.java.net Thu May 19 15:37:29 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 19 May 2022 15:37:29 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Wed, 18 May 2022 23:28:22 GMT, Vladimir Ivanov wrote: >> Its more of a chicken-egg problem here, for masked reverse operation, Reverse IR node is followed by a Blend Node, thus in such a case doing an eager Identity transform in Reverse::Identity will not work, also deferring this to blend may also not work since it could be a non-masked reverse operation, we could have handled it as a special case in inline_vector_nary_operation, but handling such special case in final graph reshaping looked more appropriate. >> >> https://github.com/openjdk/panama-vector/pull/182#discussion_r845678080 > > Do you mean it's important to apply the transformation at the right node (pick the right node as the root) and it is hard to make a decision during GVN? Yes, that what I meant. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Thu May 19 15:41:19 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 19 May 2022 15:41:19 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Wed, 18 May 2022 23:35:54 GMT, Vladimir Ivanov wrote: >> It was an attempt to facilitate in-lining of these APIs over targets which do not intrinsify them. I agree its not a generic fix since three APIs are piggybacking on same entry point and without the knowledge of opcode it will be inappropriate to take any call at this place, lazy intrinsification gives opportunity for some of the predications to concertize as compilation happens under closed world assumptions. > > Still not clear why the code is shaped the way it is. > > `Matcher::match_rule_supported_vector()` already checks that there are relevant matching rules. > > The checks require both `CompressM` and `CompressV` to be present to enable the intrinsic. Is it important? > > Also, it doesn't take `EnableVectorSupport` into account while all other vector intrinsics respect it. Yes, the code was modified to accommodate your comments. https://github.com/openjdk/jdk/pull/8425/files#diff-a9dd7e411772c1ee37b54c5ab868a01fe82af905758350f0ba1c370f422c3fe6R718 ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From sviswanathan at openjdk.java.net Thu May 19 17:04:40 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 17:04:40 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: <4-aelr3_TtPuLpRPMX6zTyo9ml46BRnmaZ9ZYqiDn24=.9d281c1c-fb95-41d0-8530-d1159fbbbc7b@github.com> On Thu, 19 May 2022 02:59:10 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment resolution > > @sviswa7 Can you add IR framework test to verify generation of PopulateIndex node? And regression test. > I see that [8280510](https://bugs.openjdk.java.net/browse/JDK-8280510) added only microbenchmark. @vnkozlov thanks a lot. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Thu May 19 19:32:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 19 May 2022 19:32:43 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries In-Reply-To: References: Message-ID: On Thu, 19 May 2022 06:37:28 GMT, yuta wrote: > When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), > there is no warning message (only can see info level messages if put -Xlog:os=info). > This should show a warning message to tell the user that you failed to load libraries for hsdis. > So I put a warning message to notify this. > > e.g. > ` Can you put the warning into `dll_load()`? We already print messages there with `-XX:+Vebose` (unfortunately it is available only in debug VM). Actually consider replacing print statements and `Verbose` check there with UL. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8782 From vladimir.kozlov at oracle.com Thu May 19 20:00:11 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Thu, 19 May 2022 13:00:11 -0700 Subject: Mismatched ciMethodData in replay file. In-Reply-To: <5cd7f04c-c9ad-bbaa-6cba-0616ca9cae9d@amazon.com> References: <5cd7f04c-c9ad-bbaa-6cba-0616ca9cae9d@amazon.com> Message-ID: <5ab5eda1-388e-9909-bf29-67e445a52ba8@oracle.com> Narrowed to hotspot-compiler list. You are right that it is weird. The dump is done from ciMethodData which is local (for compiler thread) clone of MDO which should not be updated during compilation. Unless there is a place we still go into VM for get some numbers. One explanation is that during dump we hit safepoint and we lost part of output. I think we need a verification mode for replay dump to catch such case (separate count to catch such mismatch). Thanks, Vladimir K On 5/18/22 11:14 PM, Liu, Xin wrote: > hi, > > I get a weird replay, which was generated by 17.0.3+6-LTS. I don't see > relevant code have changed since then, so I think it is still applicable > to the tip of HotSpot. > > A customer shared the replay file > with(https://github.com/corretto/corretto-17/issues/57#issuecomment-1130042063) > and I am trying to reproduce his failure. it is written from > VMError::report_and_die(). > > One obstacle is that weird entries of ciMethodData. eg. line > 14130, It declares that there will 2 non-null oops followed(see '2' > after tag 'oops'. however, one only is recorded. > > ciMethodData kotlin/coroutines/jvm/internal/ContinuationImpl > (Lkotlin/coroutines/Continuation;)V 2 21538 orig 80 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 data > 26 0x40007 0x402 0x70 0x4e9c 0x70005 0x4d55 0x0 0x7f6a5841c3c0 0xa3 > 0x7f6a5841c470 0xa4 0xc0003 0x4e9c 0x18 0x110002 0x529e 0x0 0x0 0x0 0x0 > 0x0 0x0 0x9 0x2 0x6 0x0 oops 2 7 > com/example/ProductAttRouter$withRequestLoggingContext$1 methods 0 > > Another mismatched entry is at line 14203. it says there are 11 > oops but only 6 are there. > > Those mismatched entries leave uninitialized elements of rec->_classes > and eventually crash ciReplay::initialize(). Have you seen them before? > I can patch up hotspot to handle this mismatch, but I wonder how that > happens? > > ciMethodData::dump_replay_data() iterates > _data 2 rounds. The 1st round counts them and second round dumps them. > https://github.com/openjdk/jdk/blob/master/src/hotspot/share/ci/ciMethodData.cpp#L728 > > Is that possible that the underlying data get updated on the fly? This > case is kotlin coroutine. I am not > sure whether it is same threading environment as classic Java. > > thanks, > --lx > From dlong at openjdk.java.net Thu May 19 20:02:07 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 19 May 2022 20:02:07 GMT Subject: Integrated: 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest In-Reply-To: References: Message-ID: On Fri, 13 May 2022 02:21:06 GMT, Dean Long wrote: > This test was failing because the safepoint polling stub was only saving the low 8 bytes of XMM16-XMM31. We need to save all 16 bytes by default. I also added "wide" to the boolean parameter name to better reflect what it controls. And I made the asserts in save_live_registers() match what we have in restore_live_registers(). This pull request has now been integrated. Changeset: b0892295 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/b0892295ee12c0c58c0121ff7f5f585b32d60eeb Stats: 28 lines in 1 file changed: 2 ins; 0 del; 26 mod 8271078: jdk/incubator/vector/Float128VectorTests.java failed a subtest Reviewed-by: kvn, vlivanov ------------- PR: https://git.openjdk.java.net/jdk/pull/8690 From duke at openjdk.java.net Thu May 19 20:28:48 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 19 May 2022 20:28:48 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: <8Y7bFHBuB48Jryz-OUFEXL3oqKE6pUzlU4Y9VRclix8=.8d1283ef-959b-43a4-848a-b0268b0978e7@github.com> On Thu, 19 May 2022 05:22:34 GMT, Quan Anh Mai wrote: > It sounds strange, please show the asm of your patch with respects to the benchmark. Also, please try cmoving with other arbitrary values such as 19 and 7 instead of `false` and `true`. The latter may be recognised as simple boolean not operation, remove the real comparison part, which defeats the purpose of Vladimir's suggestion. > > Regarding vectorisation, `isNaN` is a simple comparison and can be easily auto-vectorised without help from intrinsics. > > My speculation: > > A native comparison such as `x != x` can be parsed directly by the compiler. As a result, the graph of the expression `if (x != x)` is simply > > ``` > CmpF > | > Bool > | > If > ``` > > Your intrinsics, on the other hand, do not return the results on the flags, which leads to an extra comparison when using in conditions, `if(isNaN(x))` becomes > > ``` > IsNaN 0 > \ / > CmpI > | > Bool > | > If > ``` > > In your benchmark, however, using this comparison to cmoving between 0 and 1 (`false` and `true`), the compiler recognised the pattern `x != 0 ? 0 : 1` with `x` having the type of `TypeInt::BOOL`. As a result, it reduces the graph into > > ``` > IsNaN 1 > \ / > XorI > ``` > > Personally, I'm not into this implementation of intrinsics. FYI, gcc and clang both use sequences similar to `x != x` for `std::isnan`, `Math.abs(x) <= MAX_VALUE` for `std::isfinite` and `Math.abs(x) > MAX_VALUE` for `std::isinf`. The first one reduces to a single instruction `ucomiss x, x` so there is no reason to optimise further. The others are compiled down to 2 instructions each `vandpd t, x, [SIGN_ELIMINATE]; ucomiss t, [MAX_VALUE]`, so to optimise these further requires careful assessments. > > If you feel comfortable I would suggest you build the graph for these intrinsics as > > ``` > X > | > Bool 1 0 > \ | / > CMove > ``` > > Then we can add ideal rules to `BoolNode` to recognise the patterns > > ``` > X > | > Bool 1 0 > \ | / > CMove 0 > \ / > CmpI > | > Bool > ``` > > And reduce them to > > ``` > X > | > Bool > ``` > > With this, we can have the `Double::isInfinite` intrinsics compiled down to `vfpclass k, x; ktest k`, which is much more preferable. For non-AVX512DQ though I would prefer implementing them in Java similar to described above. Both abs and comparison nodes are not hard to be vectorised so it would not be a problem. > > Thanks a lot. As per the performance data, - for `isInfinite()`, the intrinsic is giving 60% speedup for `Float` (and 40% for `Double`) compared to the Java implementation. - for` isFinite()`, the intrinsic is giving 20% speedup for `Float` (and 6% slowdown for `Double`) instead of the Java implementation (`return Math.abs(f) <= Float.MAX_VALUE`). - for `isNan()`, the flags fixup is skewing the results in favor of the intrinsic (5x for` Float `and 4x for `Double`) but you're saying that your patch will make this intrinsic approach unnecessary. I agree with your logic, but we also need to back it up with performance data. Could you please run my benchmark for `isNaN() `with your patch and provide the performance data? That way we can quantify the performance of your proposed optimization of `isNaN()`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Thu May 19 20:46:40 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Thu, 19 May 2022 20:46:40 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v2] In-Reply-To: References: Message-ID: On Wed, 18 May 2022 14:59:49 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 >> >> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from >> >> ucomiss xmm0, xmm0 >> jp label >> jne label >> >> into >> >> ucomiss xmm0, xmm0 >> jp label >> >> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> jnp done >> pushf >> andq [rsp], 0xffffff2b >> popf >> done: >> movl eax, 1 >> cmovel eax, ecx >> >> The patch changes this sequence into >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> movl eax, 1 >> cmovpl eax, ecx >> cmovnel eax, ecx >> >> 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. >> >> The benchmark results are as follow: >> >> Before: >> Benchmark Mode Cnt Score Error Units >> FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op >> FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op >> FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op >> FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op >> FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op >> FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op >> FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op >> FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op >> >> After: >> Benchmark Mode Cnt Score Error Units >> FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op >> FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op >> FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op >> FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op >> FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op >> FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op >> FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op >> FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision: > > - incidental ws > - add tests > - Merge branch 'master' into fpcompare > - fix tests > - test > - improve infinity > - remove expensive rules > - improve fp comparison Could you pls show the performance delta with the baseline and after the patch? Otherwise, people reviewing this PR have to manually compute how much improvement is obtained. For example, `FPComparison.isNanFloat` is showing `4.7x` improvement. Kindly fill the delta column for rest of the data points. Instead of two separate tables, the suggested table format for each row would be: ` , , , ` ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From jbhateja at openjdk.java.net Thu May 19 21:11:41 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 19 May 2022 21:11:41 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v7] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. - 8284960: Adding --enable-preview in vectorAPI benchmarks. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Review comments resolution. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Correcting a typo. - 8284960: Integrating changes from panama-vector (Add @since 19 tags). - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - ... and 6 more: https://git.openjdk.java.net/jdk/compare/9f562ef7...311f3233 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8425/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=06 Stats: 38049 lines in 228 files changed: 16683 ins; 16923 del; 4443 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Thu May 19 21:14:18 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 19 May 2022 21:14:18 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Thu, 19 May 2022 15:33:49 GMT, Jatin Bhateja wrote: >> Do you mean it's important to apply the transformation at the right node (pick the right node as the root) and it is hard to make a decision during GVN? > > Yes, that what I meant, but with recently added Node::Flag_is_predicated_using_blend it could be possible to move this transformation ahead into idealization routines of reverse/reverse bytes IR nodes. Addressed this after internally discussing with Sandhya. Moved the transforms from final graph re-shaping back to vector intrinsic routines. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From psandoz at openjdk.java.net Thu May 19 21:23:22 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 19 May 2022 21:23:22 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v7] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 21:11:41 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. > - 8284960: Adding --enable-preview in vectorAPI benchmarks. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Review comments resolution. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Correcting a typo. > - 8284960: Integrating changes from panama-vector (Add @since 19 tags). > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - ... and 6 more: https://git.openjdk.java.net/jdk/compare/9f562ef7...311f3233 src/jdk.compiler/share/classes/com/sun/tools/javac/code/Preview.java line 50: > 48: import java.util.Set; > 49: > 50: import static com.sun.tools.javac.code.Flags.PREVIEW_API; Suggestion: Redundant import (sorry i should have checked before i sent you updates to this area) ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From duke at openjdk.java.net Thu May 19 22:46:44 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Thu, 19 May 2022 22:46:44 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v3] In-Reply-To: References: Message-ID: > The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: > > - Asserted the condition that makes sure the code runs at least once > - Asserted the condition that makes sure the code runs only once > - Removed the `while` loop > - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. > - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. > > Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: Updated based on PR feedback ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8684/files - new: https://git.openjdk.java.net/jdk/pull/8684/files/c055174a..8129fe21 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=01-02 Stats: 14 lines in 1 file changed: 3 ins; 6 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/8684.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8684/head:pull/8684 PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Thu May 19 22:54:25 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Thu, 19 May 2022 22:54:25 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v4] In-Reply-To: References: Message-ID: > The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: > > - Asserted the condition that makes sure the code runs at least once > - Asserted the condition that makes sure the code runs only once > - Removed the `while` loop > - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. > - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. > > Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: Removing whitespace ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8684/files - new: https://git.openjdk.java.net/jdk/pull/8684/files/8129fe21..df661be7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=02-03 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8684.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8684/head:pull/8684 PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Thu May 19 22:54:26 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Thu, 19 May 2022 22:54:26 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: <9tV7TXcuMDkKb0tIBZnIzzwiAp3knfEnad8t8d5p0Q8=.e7dfb533-e5c1-42ac-b8a6-61ac7b59e1db@github.com> References: <9tV7TXcuMDkKb0tIBZnIzzwiAp3knfEnad8t8d5p0Q8=.e7dfb533-e5c1-42ac-b8a6-61ac7b59e1db@github.com> Message-ID: On Thu, 19 May 2022 07:59:05 GMT, Roberto Casta?eda Lozano wrote: >> Looks good to me and tests passed. @robcasloz should also have a look. > >> Looks good to me and tests passed. @robcasloz should also have a look. > > Running some additional tests, will come back with the results. Thank you @robcasloz for the suggestions, hopefully I've incorporated them as you expected. Please let me know if I should make further changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Thu May 19 22:54:27 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Thu, 19 May 2022 22:54:27 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: <3s-MkuRpH7XNy0XyBp0lVdz1LWBs2VzAtASR4YTT9yo=.85632fe2-dcc9-4bae-b4e3-dd011014ab08@github.com> References: <3s-MkuRpH7XNy0XyBp0lVdz1LWBs2VzAtASR4YTT9yo=.85632fe2-dcc9-4bae-b4e3-dd011014ab08@github.com> Message-ID: On Thu, 19 May 2022 11:54:59 GMT, Roberto Casta?eda Lozano wrote: >> Brian J. Stafford has updated the pull request incrementally with two additional commits since the last revision: >> >> - Removed whitespace >> - Added braces for if statements > > src/hotspot/share/opto/lcm.cpp line 330: > >> 328: // Give up hoisting if we have to move the store past any load. >> 329: if (was_store) { >> 330: // Start searching here for a local load > > This comment is obsolete and can be removed, as it implicitly refers to a loop that no longer exists. Removed. > src/hotspot/share/opto/lcm.cpp line 333: > >> 331: // mach use (faulting) trying to hoist >> 332: // n might be blocker to hoisting >> 333: // This assert ensures that the following code should be run > > This comment and the similar one for the second assertion are also obsolete for the same reason. Instead, you could add a comment explaining why we expect `get_block_for_node(mb->pred(1)) == block`. I suggest something along these lines: _`mach` is a store, hence `block` is the immediate dominator of `mb`. Due to the null-check shape of `block` (where its successors cannot re-join), `block` must be the direct predecessor of `mb`._ Thank you, I've updated the comments as you've suggested. > src/hotspot/share/opto/lcm.cpp line 352: > >> 350: >> 351: // This assert ensures that the above code should only be run once >> 352: assert(get_block_for_node(mb->pred(1)) == block, "Unexpected predecessor block"); > > The single-predecessor test and the assertion `get_block_for_node(mb->pred(1)) == block` can be moved above the for-loop to make the assertion `mb != block` redundant. Moved. I added braces around the `continue` to match the expected coding convention. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Thu May 19 23:22:41 2022 From: duke at openjdk.java.net (aamarsh) Date: Thu, 19 May 2022 23:22:41 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v9] In-Reply-To: References: Message-ID: <32hB6Trkk5qBfCSioCnj38-ue_baOYffKkgyl62_aD8=.0e740f8d-37aa-4da1-88d9-2ef2d1dd7b35@github.com> > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** > Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 aamarsh has updated the pull request incrementally with one additional commit since the last revision: removed iterative fix and refactored ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/a2811a8f..0805514a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=07-08 Stats: 66 lines in 5 files changed: 13 ins; 28 del; 25 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From sviswanathan at openjdk.java.net Thu May 19 23:34:59 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 23:34:59 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: References: Message-ID: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with two additional commits since the last revision: - remove warmup - Add jtreg test ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8778/files - new: https://git.openjdk.java.net/jdk/pull/8778/files/ab07fae9..727daec0 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=01-02 Stats: 114 lines in 1 file changed: 114 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8778.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778 PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Thu May 19 23:35:02 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Thu, 19 May 2022 23:35:02 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 02:59:10 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment resolution > > @sviswa7 Can you add IR framework test to verify generation of PopulateIndex node? And regression test. > I see that [8280510](https://bugs.openjdk.java.net/browse/JDK-8280510) added only microbenchmark. @vnkozlov I have added the IR framework jtreg test. Please review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From duke at openjdk.java.net Thu May 19 23:37:40 2022 From: duke at openjdk.java.net (aamarsh) Date: Thu, 19 May 2022 23:37:40 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v10] In-Reply-To: References: Message-ID: <23Qt5RAFxI-R0ZMFRlB7doi2M7RN-Z4hvCdrVl3iWX4=.5fc307d1-daf1-4fb3-b8cb-0f01e3e12d15@github.com> > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** > Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 aamarsh has updated the pull request incrementally with one additional commit since the last revision: fixed up whitespaces and indents ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/0805514a..ab33d8a9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=09 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=08-09 Stats: 7 lines in 3 files changed: 0 ins; 5 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Thu May 19 23:42:56 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Thu, 19 May 2022 23:42:56 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v9] In-Reply-To: <32hB6Trkk5qBfCSioCnj38-ue_baOYffKkgyl62_aD8=.0e740f8d-37aa-4da1-88d9-2ef2d1dd7b35@github.com> References: <32hB6Trkk5qBfCSioCnj38-ue_baOYffKkgyl62_aD8=.0e740f8d-37aa-4da1-88d9-2ef2d1dd7b35@github.com> Message-ID: On Thu, 19 May 2022 23:22:41 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** >> Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 > > aamarsh has updated the pull request incrementally with one additional commit since the last revision: > > removed iterative fix and refactored Some NITs. src/hotspot/share/opto/compile.cpp line 2179: > 2177: } > 2178: bool progress; > 2179: NIT: empty line. src/hotspot/share/opto/compile.cpp line 2210: > 2208: // by removing some allocations and/or locks. > 2209: } while (progress); > 2210: NIT: empty line. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Thu May 19 23:43:03 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Thu, 19 May 2022 23:43:03 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: On Tue, 10 May 2022 00:21:45 GMT, Vladimir Kozlov wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/escape.cpp line 248: > >> 246: #ifndef PRODUCT >> 247: escape_state_statistics(java_objects_worklist); >> 248: #endif > > You can use `NOT_PRODUCT()` macro for one line which you have a lot in these changes. NIT: Use NOT_PRODUCT ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From psandoz at openjdk.java.net Thu May 19 23:51:27 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Thu, 19 May 2022 23:51:27 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v7] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 21:11:41 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. > - 8284960: Adding --enable-preview in vectorAPI benchmarks. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Review comments resolution. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Correcting a typo. > - 8284960: Integrating changes from panama-vector (Add @since 19 tags). > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - ... and 6 more: https://git.openjdk.java.net/jdk/compare/9f562ef7...311f3233 src/jdk.compiler/share/classes/com/sun/tools/javac/code/Preview.java line 132: > 130: * @return true if {@code s} is participating in the preview of {@code previewSymbol} > 131: */ > 132: public boolean isPreviewParticipating(Symbol s, Symbol previewSymbol) { Some feedback from a colleague: Suggestion: /** * Returns true if {@code s} is deemed to participate in the preview of {@code previewSymbol}, and * therefore no warnings or errors will be produced. * * @param s the symbol depending on the preview symbol * @param previewSymbol the preview symbol marked with @Preview * @return true if {@code s} is participating in the preview of {@code previewSymbol} */ public boolean participatesInPreview(Symbol s, Symbol previewSymbol) { ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From duke at openjdk.java.net Thu May 19 23:52:41 2022 From: duke at openjdk.java.net (aamarsh) Date: Thu, 19 May 2022 23:52:41 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v11] In-Reply-To: References: Message-ID: > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** > Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 aamarsh has updated the pull request incrementally with one additional commit since the last revision: fixed trailing whitespace jcheck error ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/ab33d8a9..3e3aaf57 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=09-10 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Thu May 19 23:52:43 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Thu, 19 May 2022 23:52:43 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: On Thu, 5 May 2022 19:06:40 GMT, Xin Liu wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/escape.cpp line 3794: > >> 3792: _compile->_local_arg_escape_ctr++; >> 3793: } >> 3794: else if (ptn->escape_state() == PointsToNode::GlobalEscape) { > > "else if" style is not consistent with others. NIT: Move "else if" to line 3793. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Fri May 20 00:24:43 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 00:24:43 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> References: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> Message-ID: On Thu, 19 May 2022 23:34:59 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with two additional commits since the last revision: > > - remove warmup > - Add jtreg test Good. You need second review ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Fri May 20 00:49:53 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 00:49:53 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v2] In-Reply-To: References: <0gCuEnJHcbS5WwOKcRAt6rZy8bhoX2Rsxpd06GW2-p8=.c87668b8-73a1-42a0-b6d5-9e52ca092cc0@github.com> Message-ID: On Thu, 19 May 2022 02:59:10 GMT, Vladimir Kozlov wrote: >> Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: >> >> review comment resolution > > @sviswa7 Can you add IR framework test to verify generation of PopulateIndex node? And regression test. > I see that [8280510](https://bugs.openjdk.java.net/browse/JDK-8280510) added only microbenchmark. @vnkozlov Thanks a lot for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From dlong at openjdk.java.net Fri May 20 01:10:05 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 20 May 2022 01:10:05 GMT Subject: RFR: 8287052: comparing double to max_intx gives unexpected results Message-ID: It isn't safe to compare a double to 2^63-1 because the latter will be changed to 2^63 when converting to a double. This fixes a compiler warning with clang-12 and prevents the code from returning a negative value in some cases (observed on x64 with g++). ------------- Commit messages: - better overflow check Changes: https://git.openjdk.java.net/jdk/pull/8798/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8798&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287052 Stats: 9 lines in 1 file changed: 6 ins; 2 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8798.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8798/head:pull/8798 PR: https://git.openjdk.java.net/jdk/pull/8798 From pli at openjdk.java.net Fri May 20 02:00:46 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Fri, 20 May 2022 02:00:46 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> References: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> Message-ID: On Thu, 19 May 2022 23:34:59 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with two additional commits since the last revision: > > - remove warmup > - Add jtreg test test/hotspot/jtreg/compiler/vectorization/TestPopulateIndex.java line 29: > 27: * @requires vm.compiler2.enabled > 28: * @requires vm.cpu.features ~= ".*avx2.*" > 29: * @requires os.arch=="amd64" | os.arch=="x86_64" Can we simplify `os.arch=="amd64" | os.arch=="x86_64"` to `os.simpleArch == "x64"` ? This test runs on x86 only. It would be nice if it can run on AArch64 as well. So perhaps something like 28 * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*") | 29 * (os.simpleArch == "aarch64" & vm.cpu.features ~= ".*sve.*") ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Fri May 20 02:52:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 02:52:45 GMT Subject: RFR: 8287052: comparing double to max_intx gives unexpected results In-Reply-To: References: Message-ID: On Fri, 20 May 2022 01:03:02 GMT, Dean Long wrote: > It isn't safe to compare a double to 2^63-1 because the latter will be changed to 2^63 when converting to a double. > This fixes a compiler warning with clang-12 and prevents the code from returning a negative value in some cases (observed on x64 with g++). I think it is **insane** to allow DBL_MAX as upper limit for CompileThresholdScaling: range(0.0, DBL_MAX) It should be `(double)max_intx` since we can't have compilation threshold (which is integer) more than that. If limited to that, your checks for NAN and INF double values will not be needed because you will only can get max_intx^2 value. You can convert them to assert to catch case if `scale` is something different from `CompileThresholdScaling`. But I don't see such path in code. ------------- Changes requested by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8798 From duke at openjdk.java.net Fri May 20 03:07:49 2022 From: duke at openjdk.java.net (Haomin) Date: Fri, 20 May 2022 03:07:49 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> References: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> Message-ID: On Thu, 19 May 2022 01:48:37 GMT, Xiaohong Gong wrote: > Do you mean the benchmarks? Currently the vectorapi benchmarks only exist on panama-vector, please see: https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation I have tried to make images on branch vectorIntrinsics of panama-vector, but get an error. warning: unknown enum constant Feature.FOREIGN warning: unknown enum constant Feature.FOREIGN warning: unknown enum constant Feature.FOREIGN error: warnings found and -Werror specified 1 error 7 warnings make[3]: *** [/home/wx/wangxue/panama-vector/build/linux-x86_64-server-release/buildtools/interim_langtools_modules/java.compiler.interim/_the.BUILD_java.compiler.interim_batch] Error 1 make[2]: *** [interim-langtools] Error 2 Is this error related to bootjdk? ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From xgong at openjdk.java.net Fri May 20 03:20:56 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Fri, 20 May 2022 03:20:56 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> Message-ID: On Fri, 20 May 2022 03:01:03 GMT, Haomin wrote: >> Do you mean the benchmarks? Currently the vectorapi benchmarks only exist on panama-vector, please see: https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > >> Do you mean the benchmarks? Currently the vectorapi benchmarks only exist on panama-vector, please see: https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation > > I have tried to make images on branch vectorIntrinsics of panama-vector, but get an error. > > > warning: unknown enum constant Feature.FOREIGN > warning: unknown enum constant Feature.FOREIGN > warning: unknown enum constant Feature.FOREIGN > error: warnings found and -Werror specified > 1 error > 7 warnings > make[3]: *** [/home/wx/wangxue/panama-vector/build/linux-x86_64-server-release/buildtools/interim_langtools_modules/java.compiler.interim/_the.BUILD_java.compiler.interim_batch] Error 1 > make[2]: *** [interim-langtools] Error 2 > > > Is this error related to bootjdk? I?m sorry that I just build the latest vectorIntrinsics branch and didn't met the error. Did you just cherry-pick your patch on it? Thanks! ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Fri May 20 03:30:49 2022 From: duke at openjdk.java.net (Haomin) Date: Fri, 20 May 2022 03:30:49 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> Message-ID: On Fri, 20 May 2022 03:17:51 GMT, Xiaohong Gong wrote: >>> Do you mean the benchmarks? Currently the vectorapi benchmarks only exist on panama-vector, please see: https://github.com/openjdk/panama-vector/tree/vectorIntrinsics/test/micro/org/openjdk/bench/jdk/incubator/vector/operation >> >> I have tried to make images on branch vectorIntrinsics of panama-vector, but get an error. >> >> >> warning: unknown enum constant Feature.FOREIGN >> warning: unknown enum constant Feature.FOREIGN >> warning: unknown enum constant Feature.FOREIGN >> error: warnings found and -Werror specified >> 1 error >> 7 warnings >> make[3]: *** [/home/wx/wangxue/panama-vector/build/linux-x86_64-server-release/buildtools/interim_langtools_modules/java.compiler.interim/_the.BUILD_java.compiler.interim_batch] Error 1 >> make[2]: *** [interim-langtools] Error 2 >> >> >> Is this error related to bootjdk? > > I?m sorry that I just build the latest vectorIntrinsics branch and didn't met the error. Did you just cherry-pick your patch on it? Thanks! yes, just cherry-pick my patch. And I just build the latest vectorIntrinsics branch, also met the error. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From sviswanathan at openjdk.java.net Fri May 20 04:49:50 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 04:49:50 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v4] In-Reply-To: References: Message-ID: <8NyttNa6QYBrNwhpx1lE5F7N3c-MeLcULa7Xh65Brj8=.e208db27-95c3-40e7-a941-74ba3a827d56@github.com> > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: Change requires to add sve ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8778/files - new: https://git.openjdk.java.net/jdk/pull/8778/files/727daec0..8c69c7fc Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=02-03 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8778.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778 PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Fri May 20 04:50:01 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 04:50:01 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: References: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> Message-ID: On Fri, 20 May 2022 01:57:04 GMT, Pengfei Li wrote: >> Sandhya Viswanathan has updated the pull request incrementally with two additional commits since the last revision: >> >> - remove warmup >> - Add jtreg test > > test/hotspot/jtreg/compiler/vectorization/TestPopulateIndex.java line 29: > >> 27: * @requires vm.compiler2.enabled >> 28: * @requires vm.cpu.features ~= ".*avx2.*" >> 29: * @requires os.arch=="amd64" | os.arch=="x86_64" > > Can we simplify `os.arch=="amd64" | os.arch=="x86_64"` to `os.simpleArch == "x64"` ? > > This test runs on x86 only. It would be nice if it can run on AArch64 as well. So perhaps something like > > 28 * @requires (os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*") | > 29 * (os.simpleArch == "aarch64" & vm.cpu.features ~= ".*sve.*") @pfustc Yes, changed the requires to as suggested by you. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From jbhateja at openjdk.java.net Fri May 20 04:49:58 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 20 May 2022 04:49:58 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> References: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> Message-ID: On Thu, 19 May 2022 23:34:59 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with two additional commits since the last revision: > > - remove warmup > - Add jtreg test src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2311: > 2309: case T_SHORT: evpbroadcastw(dst, src, vlen_enc); return; > 2310: case T_FLOAT: case T_INT: evpbroadcastd(dst, src, vlen_enc); return; > 2311: case T_DOUBLE: case T_LONG: evpbroadcastq(dst, src, vlen_enc); return; Can't we use single and double precision broadcasts for floating point types, like you have done in else part It may save domain switch over penalty (Section 3.5.2.2 Bypass between Execution Domains, Intel? 64 and IA-32 Architectures Optimization Reference Manual) src/hotspot/cpu/x86/x86.ad line 8269: > 8267: format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %} > 8268: ins_encode %{ > 8269: int vlen_in_bytes = Matcher::vector_length_in_bytes(this); Matcher::vector_length can be directly used instead of following computation in line 8274 vlen_in_bytes/type2aelembytes(elem_bt) src/hotspot/cpu/x86/x86.ad line 8272: > 8270: int vlen_enc = vector_length_encoding(this); > 8271: BasicType elem_bt = Matcher::vector_element_basic_type(this); > 8272: assert($src2$$constant == 1, "required"); Ideally assertion should be the first statement in a block, since they determine the pre-conditions under which code should executed. src/hotspot/cpu/x86/x86.ad line 8288: > 8286: format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp and $scratch as TEMP" %} > 8287: ins_encode %{ > 8288: int vlen_in_bytes = Matcher::vector_length_in_bytes(this); Same as above. src/hotspot/cpu/x86/x86.ad line 8291: > 8289: int vlen_enc = vector_length_encoding(this); > 8290: BasicType elem_bt = Matcher::vector_element_basic_type(this); > 8291: assert($src2$$constant == 1, "required"); Same as above test/hotspot/jtreg/compiler/vectorization/TestPopulateIndex.java line 26: > 24: /** > 25: * @test > 26: * @summary Test vectorization of loop induction variable usage in the loop PR id missing. test/hotspot/jtreg/compiler/vectorization/TestPopulateIndex.java line 39: > 37: > 38: public class TestPopulateIndex { > 39: private static final int count = 65536; Small array size around 10K may work, we can also tune CompileThresholdScaling. test/hotspot/jtreg/compiler/vectorization/TestPopulateIndex.java line 71: > 69: > 70: public void checkResultIndexArrayFill() { > 71: for (int i = 0; i < count; ++i) { post-incrementation for consistency. test/hotspot/jtreg/compiler/vectorization/TestPopulateIndex.java line 107: > 105: > 106: public void checkResultExprWithIndex2() { > 107: for (int i = 0; i < count; ++i) { post-increment induction. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Fri May 20 05:09:41 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 05:09:41 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: review comment resolution ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8778/files - new: https://git.openjdk.java.net/jdk/pull/8778/files/8c69c7fc..1b3d0b5a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8778&range=03-04 Stats: 13 lines in 2 files changed: 3 ins; 2 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8778.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8778/head:pull/8778 PR: https://git.openjdk.java.net/jdk/pull/8778 From xxinliu at amazon.com Fri May 20 05:14:10 2022 From: xxinliu at amazon.com (Liu, Xin) Date: Thu, 19 May 2022 22:14:10 -0700 Subject: Mismatched ciMethodData in replay file. In-Reply-To: <5ab5eda1-388e-9909-bf29-67e445a52ba8@oracle.com> References: <5cd7f04c-c9ad-bbaa-6cba-0616ca9cae9d@amazon.com> <5ab5eda1-388e-9909-bf29-67e445a52ba8@oracle.com> Message-ID: hi, Vladimir, Thanks you for taking a look at this. On 5/19/22 1:00 PM, Vladimir Kozlov wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. > > > > Narrowed to hotspot-compiler list. > > You are right that it is weird. The dump is done from ciMethodData which is local (for compiler thread) clone of MDO > which should not be updated during compilation. Unless there is a place we still go into VM for get some numbers. >oh, I see! Compile constructor calls ciMethod::ensure_method_data(), It creates a snapshot of MDO at compiler's arena. > One explanation is that during dump we hit safepoint and we lost part of output. > As you said, data are local. c2 thread sweeps profileData 2 rounds. I am lost here. we write data to a fileStream. it looks like we don't yield for the safepoint synchronization between 2 rounds. how come we lost data here? > I think we need a verification mode for replay dump to catch such case (separate count to catch such mismatch). > > Thanks, > Vladimir K > Do you mean this mismatched replay data are useless? I am still trying to exact what was wrong in C2CompilerThread. Among 26 data fields, ciReplay manages to decode this sequence as "ciVirtualCallData" because header 0x70005 denotes bci=7 and tag = virtual_call_data_tag(5). 0x70005 0x4d55 0x0 0x7f6a5841c3c0 0xa3 0x7f6a5841c470 _data->_cells[4] = 0x7f6a5841c470 is the second profiling receiver's ciKlass. it's an unmapped address. This may lead us to the culprit of c2 thread crash. if we ditch ill-formed ciReplay files, we may miss the clue. thanks, --lx > On 5/18/22 11:14 PM, Liu, Xin wrote: >> hi, >> >> I get a weird replay, which was generated by 17.0.3+6-LTS. I don't see >> relevant code have changed since then, so I think it is still applicable >> to the tip of HotSpot. >> >> A customer shared the replay file >> with(https://github.com/corretto/corretto-17/issues/57#issuecomment-1130042063) >> and I am trying to reproduce his failure. it is written from >> VMError::report_and_die(). >> >> One obstacle is that weird entries of ciMethodData. eg. line >> 14130, It declares that there will 2 non-null oops followed(see '2' >> after tag 'oops'. however, one only is recorded. >> >> ciMethodData kotlin/coroutines/jvm/internal/ContinuationImpl >> (Lkotlin/coroutines/Continuation;)V 2 21538 orig 80 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 data >> 26 0x40007 0x402 0x70 0x4e9c 0x70005 0x4d55 0x0 0x7f6a5841c3c0 0xa3 >> 0x7f6a5841c470 0xa4 0xc0003 0x4e9c 0x18 0x110002 0x529e 0x0 0x0 0x0 0x0 >> 0x0 0x0 0x9 0x2 0x6 0x0 oops 2 7 >> com/example/ProductAttRouter$withRequestLoggingContext$1 methods 0 >> >> Another mismatched entry is at line 14203. it says there are 11 >> oops but only 6 are there. >> >> Those mismatched entries leave uninitialized elements of rec->_classes >> and eventually crash ciReplay::initialize(). Have you seen them before? >> I can patch up hotspot to handle this mismatch, but I wonder how that >> happens? >> >> ciMethodData::dump_replay_data() iterates >> _data 2 rounds. The 1st round counts them and second round dumps them. >> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/ci/ciMethodData.cpp#L728 >> >> Is that possible that the underlying data get updated on the fly? This >> case is kotlin coroutine. I am not >> sure whether it is same threading environment as classic Java. >> >> thanks, >> --lx >> From sviswanathan at openjdk.java.net Fri May 20 05:14:45 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 05:14:45 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: <3bxMEXGnwmhPILxTS0Va4AS0mYA5mCKPcyXLxMce0D8=.7a5dfa4c-1ca2-45b3-adc6-f74e65eca166@github.com> On Fri, 20 May 2022 05:09:41 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution @jatin-bhateja Thanks a lot for the review. Your review comments are implemented. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From sviswanathan at openjdk.java.net Fri May 20 05:14:46 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 05:14:46 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: References: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> Message-ID: On Fri, 20 May 2022 04:04:42 GMT, Jatin Bhateja wrote: >> Sandhya Viswanathan has updated the pull request incrementally with two additional commits since the last revision: >> >> - remove warmup >> - Add jtreg test > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2311: > >> 2309: case T_SHORT: evpbroadcastw(dst, src, vlen_enc); return; >> 2310: case T_FLOAT: case T_INT: evpbroadcastd(dst, src, vlen_enc); return; >> 2311: case T_DOUBLE: case T_LONG: evpbroadcastq(dst, src, vlen_enc); return; > > Can't we use single and double precision broadcasts for floating point types, like you have done in else part > It may save domain switch over penalty (Section 3.5.2.2 Bypass between Execution Domains, Intel? 64 and IA-32 Architectures Optimization Reference Manual) The floating point broadcast doesn't take the gpr as second source. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From duke at openjdk.java.net Fri May 20 05:32:50 2022 From: duke at openjdk.java.net (yuta) Date: Fri, 20 May 2022 05:32:50 GMT Subject: RFR: 8287001: Add warning message when fail to load hsdis libraries In-Reply-To: References: Message-ID: On Thu, 19 May 2022 19:29:28 GMT, Vladimir Kozlov wrote: >> When failing to load hsdis(Hot Spot Disassembler) library (because there is no library or hsdis.so is old and so on), >> there is no warning message (only can see info level messages if put -Xlog:os=info). >> This should show a warning message to tell the user that you failed to load libraries for hsdis. >> So I put a warning message to notify this. >> >> e.g. >> ` > > Can you put the warning into `dll_load()`? > We already print messages there with `-XX:+Vebose` (unfortunately it is available only in debug VM). > Actually consider replacing print statements and `Verbose` check there with UL. Hi @vnkozlov , I think if put warning into `Disassembler::dll_load()` , this would print duplicated error messages for each hsdis library as written in `Disassembler::load_library`. // Find the disassembler shared library. // Search for several paths derived from libjvm, in this order: // 1. /lib//libhsdis-.so (for compatibility) // 2. /lib//hsdis-.so // 3. /lib/hsdis-.so // 4. hsdis-.so (using LD_LIBRARY_PATH) Besides, message with `Verbose` check seems to check whether the length of the path of hsdis library is beyond the defined length ( not including the case of no library, or old version). So, it seems reasonable to put the warning message into `Disassembler::load_library` after finishing to check all patterns. ------------- PR: https://git.openjdk.java.net/jdk/pull/8782 From jbhateja at openjdk.java.net Fri May 20 06:07:46 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 20 May 2022 06:07:46 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v3] In-Reply-To: References: <4FHuZbNaSk6UWguhn3NIoGOhFrkMhfbxkW6kzX8-p34=.b7073ff8-b27c-443c-8b2d-ad2d9db49666@github.com> Message-ID: <3YwtRnsPx58Y2EECjEBuwOIzJoeJhCYvUHiUfbQRC3A=.4d290317-4d99-4e1d-bf7b-cf52582e086c@github.com> On Fri, 20 May 2022 05:10:59 GMT, Sandhya Viswanathan wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 2311: >> >>> 2309: case T_SHORT: evpbroadcastw(dst, src, vlen_enc); return; >>> 2310: case T_FLOAT: case T_INT: evpbroadcastd(dst, src, vlen_enc); return; >>> 2311: case T_DOUBLE: case T_LONG: evpbroadcastq(dst, src, vlen_enc); return; >> >> Can't we use single and double precision broadcasts for floating point types, like you have done in else part >> It may save domain switch over penalty (Section 3.5.2.2 Bypass between Execution Domains, Intel? 64 and IA-32 Architectures Optimization Reference Manual) > > The floating point broadcast doesn't take the gpr as second source. A prior move as in else part may be emitted for consistency or you want to keep floating point broadcasts only for else part. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From jbhateja at openjdk.java.net Fri May 20 06:40:53 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 20 May 2022 06:40:53 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 05:09:41 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution Marked as reviewed by jbhateja (Committer). LGTM. Thanks ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From fgao at openjdk.java.net Fri May 20 06:45:49 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Fri, 20 May 2022 06:45:49 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 Hi @vnkozlov @rwestrel , can I get a second review please :) ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Fri May 20 07:16:50 2022 From: duke at openjdk.java.net (Haomin) Date: Fri, 20 May 2022 07:16:50 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> Message-ID: On Fri, 20 May 2022 03:27:09 GMT, Haomin wrote: >> I?m sorry that I just build the latest vectorIntrinsics branch and didn't met the error. Did you just cherry-pick your patch on it? Thanks! > > yes, just cherry-pick my patch. And I just build the latest vectorIntrinsics branch, also met the error. - CFLAGS_WARNINGS_ARE_ERRORS="-Werror" + CFLAGS_WARNINGS_ARE_ERRORS="" ... - JAVA_WARNINGS_ARE_ERRORS ?= -Werror + JAVA_WARNINGS_ARE_ERRORS ?= ... with the dirty diff, I have tested Byte128Vector. Yes, the Score about rotate is lower than before my patch. Could you give me some suggestions? ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From duke at openjdk.java.net Fri May 20 07:51:40 2022 From: duke at openjdk.java.net (aamarsh) Date: Fri, 20 May 2022 07:51:40 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v12] In-Reply-To: References: Message-ID: > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** > Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 aamarsh has updated the pull request incrementally with one additional commit since the last revision: eliminate trailing whitespace ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/3e3aaf57..337ab461 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=11 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=10-11 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From jbhateja at openjdk.java.net Fri May 20 08:49:53 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 20 May 2022 08:49:53 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 07:09:33 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: >> >> - add comment for vfpclasss/d for isFinite() >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into float >> - zero out the upper bits not written by setb >> - use 0x1 to be simpler >> - remove the redundant temp register >> - Split the macros using predicate >> - update jmh tests >> - Merge branch 'master' into float >> - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4750: > >> 4748: movdl(temp, src); >> 4749: andl(temp, KILL_SIGN_MASK); >> 4750: cmpl(temp, POS_INF); > > For IsNaN following sequence will offer better latency > "vucomiss src_xmm, src_xmm" > "setp r8" Hi @vamsi-parasa , I can see almost 30% improvement in you JMH micro for IsNaN through above suggested change. Original score over AVX2 machine: Benchmark Mode Cnt Score Error Units FloatClassCheck.testIsNaN avgt 2 0.854 ns/op FloatClassCheck.testIsNaN:?asm avgt NaN --- With new sequence: Benchmark Mode Cnt Score Error Units FloatClassCheck.testIsNaN avgt 2 0.570 ns/op FloatClassCheck.testIsNaN:?asm avgt NaN --- ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Fri May 20 09:51:24 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 20 May 2022 09:51:24 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: References: Message-ID: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8284960: Integrating incremental patches. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8425/files - new: https://git.openjdk.java.net/jdk/pull/8425/files/311f3233..17a0e38c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=06-07 Stats: 32 lines in 7 files changed: 0 ins; 26 del; 6 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Fri May 20 09:51:27 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Fri, 20 May 2022 09:51:27 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v7] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 21:19:49 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 16 commits: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. >> - 8284960: Adding --enable-preview in vectorAPI benchmarks. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: Review comments resolution. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: Correcting a typo. >> - 8284960: Integrating changes from panama-vector (Add @since 19 tags). >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - ... and 6 more: https://git.openjdk.java.net/jdk/compare/9f562ef7...311f3233 > > src/jdk.compiler/share/classes/com/sun/tools/javac/code/Preview.java line 50: > >> 48: import java.util.Set; >> 49: >> 50: import static com.sun.tools.javac.code.Flags.PREVIEW_API; > > Suggestion: > > > Redundant import (sorry i should have checked before i sent you updates to this area) Merged > src/jdk.compiler/share/classes/com/sun/tools/javac/code/Preview.java line 132: > >> 130: * @return true if {@code s} is participating in the preview of {@code previewSymbol} >> 131: */ >> 132: public boolean isPreviewParticipating(Symbol s, Symbol previewSymbol) { > > Some feedback from a colleague: > Suggestion: > > /** > * Returns true if {@code s} is deemed to participate in the preview of {@code previewSymbol}, and > * therefore no warnings or errors will be produced. > * > * @param s the symbol depending on the preview symbol > * @param previewSymbol the preview symbol marked with @Preview > * @return true if {@code s} is participating in the preview of {@code previewSymbol} > */ > public boolean participatesInPreview(Symbol s, Symbol previewSymbol) { Merged. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From chagedorn at openjdk.java.net Fri May 20 09:54:18 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 20 May 2022 09:54:18 GMT Subject: RFR: 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 Message-ID: [JDK-8286361](https://bugs.openjdk.java.net/browse/JDK-8286361) could be traced back to the same underlying problem as in [JDK-8286638](https://bugs.openjdk.java.net/browse/JDK-8286638). Pulling in the change fixed the problem. This patch unproblemlists the previously failing test and adds a new test for JDK-8286638 (extracted from compiler/c2/irTests/TestSkeletonPredicates.java) that I've used for analyzing JDK-8286361. Testing with latest JDK: - hs-tier1-4 flags for the new test - hs-tier7+8 flags for compiler/c2/irTests/TestSkeletonPredicates.java Thanks, Christian ------------- Commit messages: - remove newline - 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 Changes: https://git.openjdk.java.net/jdk/pull/8806/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8806&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286967 Stats: 79 lines in 2 files changed: 77 ins; 2 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8806.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8806/head:pull/8806 PR: https://git.openjdk.java.net/jdk/pull/8806 From mcimadamore at openjdk.java.net Fri May 20 10:37:20 2022 From: mcimadamore at openjdk.java.net (Maurizio Cimadamore) Date: Fri, 20 May 2022 10:37:20 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> Message-ID: On Fri, 20 May 2022 09:51:24 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8284960: Integrating incremental patches. Javac changes look good ------------- Marked as reviewed by mcimadamore (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8425 From jlahoda at openjdk.java.net Fri May 20 11:13:39 2022 From: jlahoda at openjdk.java.net (Jan Lahoda) Date: Fri, 20 May 2022 11:13:39 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> Message-ID: On Fri, 20 May 2022 09:51:24 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8284960: Integrating incremental patches. The javac changes look OK to me. ------------- Marked as reviewed by jlahoda (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8425 From rcastanedalo at openjdk.java.net Fri May 20 11:59:08 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 20 May 2022 11:59:08 GMT Subject: RFR: 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure Message-ID: [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) introduced an assertion in the SLP analysis verifying that the examined loop does not contain inconsistent reduction information. The assertion is placed at the beginning of the SLP analysis, and can fail even for loops that are not vectorizable by SLP (and hence without risk of being miscompiled). This is the case for the [reported issue](https://bugs.openjdk.java.net/browse/JDK-8286177) and for many other recent failures triggered by JavaFuzzer-generated test cases. This changeset postpones the assertion to the SLP output phase to ensure that it only fails when the program really risks being miscompiled. Given that many transformations can invalidate reduction information (see for example the JBS reports for [JDK-8261147](https://bugs.openjdk.java.net/browse/JDK-8261147), [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622), and [JDK-8286177](https://bugs.openjdk.java.net/browse/JDK-8286177)), a more fundamental fix would be to remove the possibility of inconsistent reduction information by construction, by running the reduction analysis on-demand. I will file a RFE to explore this idea after JDK 19. #### Testing - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode). - Tested that the assertion, in its new placement, would still have caught the bug reported in [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) in the original scenario. - Tested that the assertion, in its new placement, does not fail for 27 JavaFuzzer-generated test cases that trigger the assertion in its original placement. ------------- Commit messages: - Add reduced test case - Postpone consistent reduction information assert to SuperWord::output() Changes: https://git.openjdk.java.net/jdk/pull/8805/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8805&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286177 Stats: 67 lines in 2 files changed: 65 ins; 2 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8805.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8805/head:pull/8805 PR: https://git.openjdk.java.net/jdk/pull/8805 From rcastanedalo at openjdk.java.net Fri May 20 12:10:49 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 20 May 2022 12:10:49 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v4] In-Reply-To: References: Message-ID: On Thu, 19 May 2022 22:54:25 GMT, Brian J. Stafford wrote: >> The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: >> >> - Asserted the condition that makes sure the code runs at least once >> - Asserted the condition that makes sure the code runs only once >> - Removed the `while` loop >> - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. >> - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. >> >> Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. > > Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: > > Removing whitespace Thanks for addressing all my comments! I just have a last, style comment on the updated revision. src/hotspot/share/opto/lcm.cpp line 336: > 334: //mach is a store, hence block is the immediate dominator of mb. > 335: //Due to the null-check shape of block (where its successors cannot re-join), > 336: //block must be the direct predecessor of mb. Please, introduce a single space between each `//` and the comment text. ------------- Changes requested by rcastanedalo (Committer). PR: https://git.openjdk.java.net/jdk/pull/8684 From chagedorn at openjdk.java.net Fri May 20 12:45:15 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Fri, 20 May 2022 12:45:15 GMT Subject: RFR: 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure In-Reply-To: References: Message-ID: On Fri, 20 May 2022 09:45:19 GMT, Roberto Casta?eda Lozano wrote: > [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) introduced an assertion in the SLP analysis verifying that the examined loop does not contain inconsistent reduction information. The assertion is placed at the beginning of the SLP analysis, and can fail even for loops that are not vectorizable by SLP (and hence without risk of being miscompiled). This is the case for the [reported issue](https://bugs.openjdk.java.net/browse/JDK-8286177) and for many other recent failures triggered by JavaFuzzer-generated test cases. This changeset postpones the assertion to the SLP output phase to ensure that it only fails when the program really risks being miscompiled. > > Given that many transformations can invalidate reduction information (see for example the JBS reports for [JDK-8261147](https://bugs.openjdk.java.net/browse/JDK-8261147), [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622), and [JDK-8286177](https://bugs.openjdk.java.net/browse/JDK-8286177)), a more fundamental fix would be to remove the possibility of inconsistent reduction information by construction, by running the reduction analysis on-demand. I will file a RFE to explore this idea after JDK 19. > > #### Testing > > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode). > - Tested that the assertion, in its new placement, would still have caught the bug reported in [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) in the original scenario. > - Tested that the assertion, in its new placement, does not fail for 27 JavaFuzzer-generated test cases that trigger the assertion in its original placement. That looks reasonable to go with this lower-risk fix for JDK 19 since the assertion warns about a problem that has been around for quite some time. It makes sense to revisit this again later and propose a full fix in JDK 20 to avoid similar problems with wrongly marked reduction nodes in the future. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8805 From rcastanedalo at openjdk.java.net Fri May 20 13:23:55 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 20 May 2022 13:23:55 GMT Subject: RFR: 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure In-Reply-To: References: Message-ID: On Fri, 20 May 2022 12:42:52 GMT, Christian Hagedorn wrote: > That looks reasonable to go with this lower-risk fix for JDK 19 since the assertion warns about a problem that has been around for quite some time. It makes sense to revisit this again later and propose a full fix in JDK 20 to avoid similar problems with wrongly marked reduction nodes in the future. Thanks for the review, Christian! I just filed a [RFE](https://bugs.openjdk.java.net/browse/JDK-8287087) for this. ------------- PR: https://git.openjdk.java.net/jdk/pull/8805 From kvn at openjdk.java.net Fri May 20 15:03:04 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 15:03:04 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 05:09:41 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution I have to re-test it since code change in .ad file. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From vladimir.kozlov at oracle.com Fri May 20 15:31:51 2022 From: vladimir.kozlov at oracle.com (Vladimir Kozlov) Date: Fri, 20 May 2022 08:31:51 -0700 Subject: [EXTERNAL]Mismatched ciMethodData in replay file. In-Reply-To: References: <5cd7f04c-c9ad-bbaa-6cba-0616ca9cae9d@amazon.com> <5ab5eda1-388e-9909-bf29-67e445a52ba8@oracle.com> Message-ID: On 5/19/22 10:14 PM, Liu, Xin wrote: > hi, Vladimir, > > Thanks you for taking a look at this. > > > On 5/19/22 1:00 PM, Vladimir Kozlov wrote: >> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. >> >> >> >> Narrowed to hotspot-compiler list. >> >> You are right that it is weird. The dump is done from ciMethodData which is local (for compiler thread) clone of MDO >> which should not be updated during compilation. Unless there is a place we still go into VM for get some numbers. >> oh, I see! Compile constructor calls ciMethod::ensure_method_data(), It > creates a snapshot of MDO at compiler's arena. > >> One explanation is that during dump we hit safepoint and we lost part of output. >> > As you said, data are local. c2 thread sweeps profileData 2 rounds. > > I am lost here. we write data to a fileStream. it looks like we don't > yield for the safepoint synchronization between 2 rounds. how come we > lost data here? > >> I think we need a verification mode for replay dump to catch such case (separate count to catch such mismatch). >> >> Thanks, >> Vladimir K >> > > Do you mean this mismatched replay data are useless? I am still trying > to exact what was wrong in C2CompilerThread. No. I am trying to suggest how we can improve replay dump code to catch such case or avoid it. May be we can do one path instead of 2 by collecting output in local buffer when counting. And I agree with your suggestion in filed RFE. Regards, Vladimir K > > Among 26 data fields, ciReplay manages to decode this sequence as > "ciVirtualCallData" because header 0x70005 denotes bci=7 and tag = > virtual_call_data_tag(5). > > > 0x70005 0x4d55 0x0 0x7f6a5841c3c0 0xa3 0x7f6a5841c470 > > > _data->_cells[4] = 0x7f6a5841c470 is the second profiling receiver's > ciKlass. it's an unmapped address. > > This may lead us to the culprit of c2 thread crash. if we ditch > ill-formed ciReplay files, we may miss the clue. > > thanks, > --lx > > > >> On 5/18/22 11:14 PM, Liu, Xin wrote: >>> hi, >>> >>> I get a weird replay, which was generated by 17.0.3+6-LTS. I don't see >>> relevant code have changed since then, so I think it is still applicable >>> to the tip of HotSpot. >>> >>> A customer shared the replay file >>> with(https://github.com/corretto/corretto-17/issues/57#issuecomment-1130042063) >>> and I am trying to reproduce his failure. it is written from >>> VMError::report_and_die(). >>> >>> One obstacle is that weird entries of ciMethodData. eg. line >>> 14130, It declares that there will 2 non-null oops followed(see '2' >>> after tag 'oops'. however, one only is recorded. >>> >>> ciMethodData kotlin/coroutines/jvm/internal/ContinuationImpl >>> (Lkotlin/coroutines/Continuation;)V 2 21538 orig 80 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 data >>> 26 0x40007 0x402 0x70 0x4e9c 0x70005 0x4d55 0x0 0x7f6a5841c3c0 0xa3 >>> 0x7f6a5841c470 0xa4 0xc0003 0x4e9c 0x18 0x110002 0x529e 0x0 0x0 0x0 0x0 >>> 0x0 0x0 0x9 0x2 0x6 0x0 oops 2 7 >>> com/example/ProductAttRouter$withRequestLoggingContext$1 methods 0 >>> >>> Another mismatched entry is at line 14203. it says there are 11 >>> oops but only 6 are there. >>> >>> Those mismatched entries leave uninitialized elements of rec->_classes >>> and eventually crash ciReplay::initialize(). Have you seen them before? >>> I can patch up hotspot to handle this mismatch, but I wonder how that >>> happens? >>> >>> ciMethodData::dump_replay_data() iterates >>> _data 2 rounds. The 1st round counts them and second round dumps them. >>> https://github.com/openjdk/jdk/blob/master/src/hotspot/share/ci/ciMethodData.cpp#L728 >>> >>> Is that possible that the underlying data get updated on the fly? This >>> case is kotlin coroutine. I am not >>> sure whether it is same threading environment as classic Java. >>> >>> thanks, >>> --lx >> > From kvn at openjdk.java.net Fri May 20 16:06:07 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 16:06:07 GMT Subject: RFR: 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure In-Reply-To: References: Message-ID: On Fri, 20 May 2022 09:45:19 GMT, Roberto Casta?eda Lozano wrote: > [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) introduced an assertion in the SLP analysis verifying that the examined loop does not contain inconsistent reduction information. The assertion is placed at the beginning of the SLP analysis, and can fail even for loops that are not vectorizable by SLP (and hence without risk of being miscompiled). This is the case for the [reported issue](https://bugs.openjdk.java.net/browse/JDK-8286177) and for many other recent failures triggered by JavaFuzzer-generated test cases. This changeset postpones the assertion to the SLP output phase to ensure that it only fails when the program really risks being miscompiled. > > Given that many transformations can invalidate reduction information (see for example the JBS reports for [JDK-8261147](https://bugs.openjdk.java.net/browse/JDK-8261147), [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622), and [JDK-8286177](https://bugs.openjdk.java.net/browse/JDK-8286177)), a more fundamental fix would be to remove the possibility of inconsistent reduction information by construction, by running the reduction analysis on-demand. I have filed a [RFE](https://bugs.openjdk.java.net/browse/JDK-8287087) to explore this idea after JDK 19. > > #### Testing > > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode). > - Tested that the assertion, in its new placement, would still have caught the bug reported in [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) in the original scenario. > - Tested that the assertion, in its new placement, does not fail for 27 JavaFuzzer-generated test cases that trigger the assertion in its original placement. I agree with suggested change. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8805 From kvn at openjdk.java.net Fri May 20 16:19:55 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 16:19:55 GMT Subject: RFR: 8286967: Unproblemlist compiler/c2/irTests/TestSkeletonPredicates.java and add additional test for JDK-8286638 In-Reply-To: References: Message-ID: On Fri, 20 May 2022 09:47:57 GMT, Christian Hagedorn wrote: > [JDK-8286361](https://bugs.openjdk.java.net/browse/JDK-8286361) could be traced back to the same underlying problem as in [JDK-8286638](https://bugs.openjdk.java.net/browse/JDK-8286638). Pulling in the change fixed the problem. > > This patch unproblemlists the previously failing test and adds a new test for JDK-8286638 (extracted from compiler/c2/irTests/TestSkeletonPredicates.java) that I've used for analyzing JDK-8286361. > > Testing with latest JDK: > - hs-tier1-4 flags for the new test > - hs-tier7+8 flags for compiler/c2/irTests/TestSkeletonPredicates.java > > Thanks, > Christian Okay. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8806 From duke at openjdk.java.net Fri May 20 16:20:24 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 20 May 2022 16:20:24 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v7] In-Reply-To: References: Message-ID: > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 337065 | 4403 | 19135 | > | dec-tree | 183045 | 2559 | 22071 | > | naive-bayes | 176460 | 2450 | 19782 | > | log-regression | 162555 | 2410 | 20648 | > | als | 121275 | 1980 | 17179 | > | movie-lens | 111915 | 1842 | 13020 | > | finagle-chirper | 106350 | 1947 | 12726 | > | gauss-mix | 81975 | 1251 | 10474 | > | finagle-http | 80895 | 1523 | 12294 | > | page-rank | 68940 | 1146 | 10124 | > | chi-square | 62130 | 974 | 9315 | > | akka-uct | 50220 | 555 | 4263 | > | reactors | 23385 | 371 | 2544 | > | philosophers | 17625 | 259 | 2865 | > | scala-stm-bench7 | 17235 | 295 | 3230 | > | scala-doku | 15600 | 214 | 2698 | > | rx-scrabble | 14190 | 262 | 2770 | > | future-genetic | 13155 | 253 | 2318 | > | scrabble | 12300 | 217 | 2352 | > | fj-kmeans | 8985 | 157 | 1616 | > | par-mnemonics | 8535 | 155 | 1684 | > | scala-kmeans | 8250 | 138 | 1624 | > | mnemonics | 7485 | 134 | 1522 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed Evgeny Astigeevich has updated the pull request incrementally with 522 additional commits since the last revision: - Remove non-existing option - Use call offset instead of caller pc - Add UseSharedStubs option - Add a test and implementation fixes - 8280481: Duplicated static stubs in NMethod Stub Code section - 8286858: Remove dead code in sun.reflect.misc.MethodUtil Reviewed-by: mchung, iris - 8285962: NimbusDefaults has a typo in a L&F property Reviewed-by: prr - 8287013: StringConcatFactory: remove short and byte mixers/prependers Reviewed-by: jlaskey - 8286893: G1: Recent card set coarsening statistics wrong Reviewed-by: tschatzl, ayang - 8286943: G1: With virtualized remembered sets, maximum number of cards configured is wrong Reviewed-by: ayang, iwalulya - ... and 512 more: https://git.openjdk.java.net/jdk/compare/718f3b05...3db0f157 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8024/files - new: https://git.openjdk.java.net/jdk/pull/8024/files/718f3b05..3db0f157 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8024&range=05-06 Stats: 260933 lines in 4174 files changed: 191483 ins; 46818 del; 22632 mod Patch: https://git.openjdk.java.net/jdk/pull/8024.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8024/head:pull/8024 PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri May 20 16:32:02 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 20 May 2022 16:32:02 GMT Subject: RFR: 8280481: Duplicated static stubs in NMethod Stub Code section [v7] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 16:20:24 GMT, Evgeny Astigeevich wrote: >> Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. >> >> Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. >> >> Each Java call has: >> - A relocation for a call site. >> - A relocation for a stub to the interpreter. >> - A stub to the interpreter. >> - If far jumps are used (arm64 case): >> - A trampoline relocation. >> - A trampoline. >> >> We cannot avoid creating relocations. They are needed to support patching call sites and stubs. >> >> One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: >> >> reloc1 ---> 0x0: stub1 >> reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) >> reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) >> >> According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): >> >> // [About Offsets] Relative offsets are supplied to this module as >> // positive byte offsets, but they may be internally stored scaled >> // and/or negated, depending on what is most compact for the target >> // system. Since the object pointed to by the offset typically >> // precedes the relocation address, it is profitable to store >> // these negative offsets as positive numbers, but this decision >> // is internal to the relocation information abstractions. >> >> However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: >> >> class CodeSection { >> ... >> private: >> ... >> address _locs_point; // last relocated position (grows upward) >> ... >> void set_locs_point(address pc) { >> assert(pc >= locs_point(), "relocation addr may not decrease"); >> assert(allocates2(pc), "relocation addr must be in this section"); >> _locs_point = pc; >> } >> >> Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. >> >> This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. >> >> There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. >> >> **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** >> Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. >> - AArch64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 820544 | 4592 | 18872 | >> | dec-tree | 405280 | 2580 | 22335 | >> | naive-bayes | 392384 | 2586 | 21184 | >> | log-regression | 362208 | 2450 | 20325 | >> | als | 306048 | 2226 | 18161 | >> | finagle-chirper | 262304 | 2087 | 12675 | >> | movie-lens | 250112 | 1937 | 13617 | >> | gauss-mix | 173792 | 1262 | 10304 | >> | finagle-http | 164320 | 1392 | 11269 | >> | page-rank | 155424 | 1175 | 10330 | >> | chi-square | 140384 | 1028 | 9480 | >> | akka-uct | 115136 | 541 | 3941 | >> | reactors | 43264 | 335 | 2503 | >> | scala-stm-bench7 | 42656 | 326 | 3310 | >> | philosophers | 36576 | 256 | 2902 | >> | scala-doku | 35008 | 231 | 2695 | >> | rx-scrabble | 32416 | 273 | 2789 | >> | future-genetic | 29408 | 260 | 2339 | >> | scrabble | 27968 | 225 | 2477 | >> | par-mnemonics | 19584 | 168 | 1689 | >> | fj-kmeans | 19296 | 156 | 1647 | >> | scala-kmeans | 18080 | 140 | 1629 | >> | mnemonics | 17408 | 143 | 1512 | >> +------------------+-------------+----------------------------+---------------------+ >> >> - X86_64 >> >> +------------------+-------------+----------------------------+---------------------+ >> | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | >> +------------------+-------------+----------------------------+---------------------+ >> | dotty | 337065 | 4403 | 19135 | >> | dec-tree | 183045 | 2559 | 22071 | >> | naive-bayes | 176460 | 2450 | 19782 | >> | log-regression | 162555 | 2410 | 20648 | >> | als | 121275 | 1980 | 17179 | >> | movie-lens | 111915 | 1842 | 13020 | >> | finagle-chirper | 106350 | 1947 | 12726 | >> | gauss-mix | 81975 | 1251 | 10474 | >> | finagle-http | 80895 | 1523 | 12294 | >> | page-rank | 68940 | 1146 | 10124 | >> | chi-square | 62130 | 974 | 9315 | >> | akka-uct | 50220 | 555 | 4263 | >> | reactors | 23385 | 371 | 2544 | >> | philosophers | 17625 | 259 | 2865 | >> | scala-stm-bench7 | 17235 | 295 | 3230 | >> | scala-doku | 15600 | 214 | 2698 | >> | rx-scrabble | 14190 | 262 | 2770 | >> | future-genetic | 13155 | 253 | 2318 | >> | scrabble | 12300 | 217 | 2352 | >> | fj-kmeans | 8985 | 157 | 1616 | >> | par-mnemonics | 8535 | 155 | 1684 | >> | scala-kmeans | 8250 | 138 | 1624 | >> | mnemonics | 7485 | 134 | 1522 | >> +------------------+-------------+----------------------------+---------------------+ >> >> >> **Testing: fastdebug and release builds for x86, x86_64 and aarch64** >> - `tier1`...`tier4`: Passed >> - `hotspot/jtreg/compiler/sharedstubs`: Passed > > Evgeny Astigeevich has updated the pull request incrementally with 522 additional commits since the last revision: > > - Remove non-existing option > - Use call offset instead of caller pc > - Add UseSharedStubs option > - Add a test and implementation fixes > - 8280481: Duplicated static stubs in NMethod Stub Code section > - 8286858: Remove dead code in sun.reflect.misc.MethodUtil > > Reviewed-by: mchung, iris > - 8285962: NimbusDefaults has a typo in a L&F property > > Reviewed-by: prr > - 8287013: StringConcatFactory: remove short and byte mixers/prependers > > Reviewed-by: jlaskey > - 8286893: G1: Recent card set coarsening statistics wrong > > Reviewed-by: tschatzl, ayang > - 8286943: G1: With virtualized remembered sets, maximum number of cards configured is wrong > > Reviewed-by: ayang, iwalulya > - ... and 512 more: https://git.openjdk.java.net/jdk/compare/718f3b05...3db0f157 Create a new PR after synchronizing with tip: https://github.com/openjdk/jdk/pull/8816 ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri May 20 16:32:04 2022 From: duke at openjdk.java.net (Evgeny Astigeevich) Date: Fri, 20 May 2022 16:32:04 GMT Subject: Withdrawn: 8280481: Duplicated static stubs in NMethod Stub Code section In-Reply-To: References: Message-ID: <3KMqX6sUCG4MrxHzHsnqBnB1k4LVyRjPJ45Jl_bz40c=.b365c8aa-ad0b-4c5a-af89-b11d53af1a4b@github.com> On Tue, 29 Mar 2022 22:09:34 GMT, Evgeny Astigeevich wrote: > Calls of Java methods have stubs to the interpreter for the cases when an invoked Java method is not compiled. Calls of static Java methods and final Java methods have statically bound information about a callee during compilation. Such calls can share stubs to the interpreter. > > Each stub to the interpreter has a relocation record (accessed via `relocInfo`) which provides an address of the stub and an address of its owner. `relocInfo` has an offset which is an offset from the previously know relocatable address. The address of a stub is calculated as the address provided by the previous `relocInfo` plus the offset. > > Each Java call has: > - A relocation for a call site. > - A relocation for a stub to the interpreter. > - A stub to the interpreter. > - If far jumps are used (arm64 case): > - A trampoline relocation. > - A trampoline. > > We cannot avoid creating relocations. They are needed to support patching call sites and stubs. > > One approach to create shared stubs to keep track of created stubs. If the needed stub exist we use its address and create only needed relocation information. The `relocInfo` for a created stub will have a positive offset. As relocations for different stubs can be created after that, a relocation for a shared stub will have a negative offset relative to the address provided by the previous relocation: > > reloc1 ---> 0x0: stub1 > reloc2 ---> 0x4: stub2 (reloc2.addr = reloc1.addr + reloc2.offset = 0x0 + 4) > reloc3 ---> 0x0: stub1 (reloc3.addr = reloc2.addr + reloc3.offset = 0x4 - 4) > > According to [relocInfo.hpp](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/code/relocInfo.hpp#L237): > > // [About Offsets] Relative offsets are supplied to this module as > // positive byte offsets, but they may be internally stored scaled > // and/or negated, depending on what is most compact for the target > // system. Since the object pointed to by the offset typically > // precedes the relocation address, it is profitable to store > // these negative offsets as positive numbers, but this decision > // is internal to the relocation information abstractions. > > However, `CodeSection` does not support negative offsets. It [assumes](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/asm/codeBuffer.hpp#L195) addresses relocations pointing at grow upward: > > class CodeSection { > ... > private: > ... > address _locs_point; // last relocated position (grows upward) > ... > void set_locs_point(address pc) { > assert(pc >= locs_point(), "relocation addr may not decrease"); > assert(allocates2(pc), "relocation addr must be in this section"); > _locs_point = pc; > } > > Negative offsets reduce the offset range by half. This can cause the increase of filler records, the empty `relocInfo` records to reduce offset values. Also negative offsets are only needed for `static_stub_type`, but other 13 types don?t need them. > > This PR implements another approach: postponed creation of stubs. First we collect requests for creating shared stubs. Then we have the finalisation phase, where shared stubs are created in `CodeBuffer`. This approach does not need negative offsets. Supported platforms are x86, x86_64 and aarch64. > > There is a new diagnostic option: `UseSharedStubs`. Its default value for x86, x86_64 and aarch64 is set true. > > **Results from [Renaissance 0.14.0](https://github.com/renaissance-benchmarks/renaissance/releases/tag/v0.14.0)** > Note: 'Nmethods with shared stubs' is the total number of nmethods counted during benchmark's run. 'Final # of nmethods' is a number of nmethods in CodeCache when JVM exited. > - AArch64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 820544 | 4592 | 18872 | > | dec-tree | 405280 | 2580 | 22335 | > | naive-bayes | 392384 | 2586 | 21184 | > | log-regression | 362208 | 2450 | 20325 | > | als | 306048 | 2226 | 18161 | > | finagle-chirper | 262304 | 2087 | 12675 | > | movie-lens | 250112 | 1937 | 13617 | > | gauss-mix | 173792 | 1262 | 10304 | > | finagle-http | 164320 | 1392 | 11269 | > | page-rank | 155424 | 1175 | 10330 | > | chi-square | 140384 | 1028 | 9480 | > | akka-uct | 115136 | 541 | 3941 | > | reactors | 43264 | 335 | 2503 | > | scala-stm-bench7 | 42656 | 326 | 3310 | > | philosophers | 36576 | 256 | 2902 | > | scala-doku | 35008 | 231 | 2695 | > | rx-scrabble | 32416 | 273 | 2789 | > | future-genetic | 29408 | 260 | 2339 | > | scrabble | 27968 | 225 | 2477 | > | par-mnemonics | 19584 | 168 | 1689 | > | fj-kmeans | 19296 | 156 | 1647 | > | scala-kmeans | 18080 | 140 | 1629 | > | mnemonics | 17408 | 143 | 1512 | > +------------------+-------------+----------------------------+---------------------+ > > - X86_64 > > +------------------+-------------+----------------------------+---------------------+ > | Benchmark | Saved bytes | Nmethods with shared stubs | Final # of nmethods | > +------------------+-------------+----------------------------+---------------------+ > | dotty | 337065 | 4403 | 19135 | > | dec-tree | 183045 | 2559 | 22071 | > | naive-bayes | 176460 | 2450 | 19782 | > | log-regression | 162555 | 2410 | 20648 | > | als | 121275 | 1980 | 17179 | > | movie-lens | 111915 | 1842 | 13020 | > | finagle-chirper | 106350 | 1947 | 12726 | > | gauss-mix | 81975 | 1251 | 10474 | > | finagle-http | 80895 | 1523 | 12294 | > | page-rank | 68940 | 1146 | 10124 | > | chi-square | 62130 | 974 | 9315 | > | akka-uct | 50220 | 555 | 4263 | > | reactors | 23385 | 371 | 2544 | > | philosophers | 17625 | 259 | 2865 | > | scala-stm-bench7 | 17235 | 295 | 3230 | > | scala-doku | 15600 | 214 | 2698 | > | rx-scrabble | 14190 | 262 | 2770 | > | future-genetic | 13155 | 253 | 2318 | > | scrabble | 12300 | 217 | 2352 | > | fj-kmeans | 8985 | 157 | 1616 | > | par-mnemonics | 8535 | 155 | 1684 | > | scala-kmeans | 8250 | 138 | 1624 | > | mnemonics | 7485 | 134 | 1522 | > +------------------+-------------+----------------------------+---------------------+ > > > **Testing: fastdebug and release builds for x86, x86_64 and aarch64** > - `tier1`...`tier4`: Passed > - `hotspot/jtreg/compiler/sharedstubs`: Passed This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/8024 From duke at openjdk.java.net Fri May 20 17:11:55 2022 From: duke at openjdk.java.net (aamarsh) Date: Fri, 20 May 2022 17:11:55 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v8] In-Reply-To: References: Message-ID: On Mon, 9 May 2022 23:48:58 GMT, Vladimir Kozlov wrote: >> aamarsh has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: >> >> adding escape analysis and scalar replacement statistics > > src/hotspot/share/opto/compile.cpp line 2217: > >> 2215: Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); >> 2216: Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); >> 2217: #endif > > These should be done inside `ConnectionGraph` - don't expose EA counters to an other class. You can use a static method in `ConnectionGraph` to do that. > > `_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced` is wrong. You are doubling number because `total_scalar_replaced` is part of `_local_no_escape_ctr`. Keep these numbers separate. Also `mexp._local_scalar_replaced` could be update later during `PhaseMacroExpand::expand_macro_nodes()` call after loop optimizations. > > And such collection is not accurate (over-counted) due to EA iterations - each iteration may add the same numbers. Which could be fine if you say that in comments so people know. @vnkozlov Thank you for the feedback! I removed the complicated iterative feedback fix and simply left a comment in the output. My example output above is updated. I added the PrintOptoStatistics flag where it felt necessary. I also pushed my commits incrementally as opposed to squashing them and force pushing as @navyxliu suggested. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From sviswanathan at openjdk.java.net Fri May 20 18:28:02 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 18:28:02 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 05:09:41 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution @vnlozlov Thanks a lot, I will wait for your go ahead. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Fri May 20 18:57:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 18:57:54 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v12] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 07:51:40 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** >> Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 > > aamarsh has updated the pull request incrementally with one additional commit since the last revision: > > eliminate trailing whitespace Thank you for addressing my comment. I have only one left. I started testing. src/hotspot/share/opto/escape.cpp line 3773: > 3771: tty->print("No escape = %d, Arg escape = %d, Global escape = %d", Atomic::load(&_no_escape_counter), Atomic::load(&_arg_escape_counter), Atomic::load(&_global_escape_counter)); > 3772: tty->print(" (EA executed in %7.2f seconds)", Atomic::load(&_time_elapsed) * 0.001); > 3773: tty->print_cr(" ** EA stats might be slightly off since objects might be double counted due to iterative EA **"); I don't think we need to print this warning. I suggest to put it as comment in this method. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Fri May 20 19:13:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 19:13:52 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v12] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 07:51:40 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** >> Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 > > aamarsh has updated the pull request incrementally with one additional commit since the last revision: > > eliminate trailing whitespace About time statistic. There is flag -XX:+CITime which prints times for all phases in C2 (and C1). For EA it is `_t_escapeAnalysis`. And I see difference in reported times (0.02 vs 0.059). Your output: (EA executed in 0.02 seconds) CITime: Escape Analysis: 0.076 s Conn Graph: 0.059 s Macro Eliminate: 0.027 s I simply run fastdebug VM with -Xcomp: java -XX:+PrintOptoStatistics -XX:+CITime -Xcomp t ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Fri May 20 19:24:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 19:24:54 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v12] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 07:51:40 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 317, Arg escape = 118, Global escape = 1995 (EA executed in 11.71 seconds) ** EA stats might be slightly off since objects might be double counted due to iterative EA ** >> Objects scalar replaced = 201, Monitor objects removed = 23, GC barriers removed = 50, Memory barriers removed = 224 > > aamarsh has updated the pull request incrementally with one additional commit since the last revision: > > eliminate trailing whitespace It is up to you but I would suggest to remove your change to collect time since we have it already with CITime. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Fri May 20 19:47:39 2022 From: duke at openjdk.java.net (aamarsh) Date: Fri, 20 May 2022 19:47:39 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v13] In-Reply-To: References: Message-ID: > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 337, Arg escape = 125, Global escape = 2025 > Objects scalar replaced = 217, Monitor objects removed = 29, GC barriers removed = 33, Memory barriers removed = 246 aamarsh has updated the pull request incrementally with one additional commit since the last revision: nit picks & removing time ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/337ab461..56d6317d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=12 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=11-12 Stats: 34 lines in 4 files changed: 0 ins; 17 del; 17 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Fri May 20 19:51:37 2022 From: duke at openjdk.java.net (aamarsh) Date: Fri, 20 May 2022 19:51:37 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: References: Message-ID: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 337, Arg escape = 125, Global escape = 2025 > Objects scalar replaced = 217, Monitor objects removed = 29, GC barriers removed = 33, Memory barriers removed = 246 aamarsh has updated the pull request incrementally with one additional commit since the last revision: make count_MemBar static ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/56d6317d..8c394555 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=13 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=12-13 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From dlong at openjdk.java.net Fri May 20 19:59:49 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 20 May 2022 19:59:49 GMT Subject: RFR: 8287052: comparing double to max_intx gives unexpected results In-Reply-To: References: Message-ID: On Fri, 20 May 2022 01:03:02 GMT, Dean Long wrote: > It isn't safe to compare a double to 2^63-1 because the latter will be changed to 2^63 when converting to a double. > This fixes a compiler warning with clang-12 and prevents the code from returning a negative value in some cases (observed on x64 with g++). The java.1 man page says: "The CompileThresholdScaling option has a floating point value between 0 and +Inf", which is not quite correct. Maybe we should change this to be less specific, or remove this part completely. Also, I'm wonder if these kinds of changes will require a CSR. ------------- PR: https://git.openjdk.java.net/jdk/pull/8798 From kvn at openjdk.java.net Fri May 20 20:15:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 20:15:54 GMT Subject: RFR: 8287052: comparing double to max_intx gives unexpected results In-Reply-To: References: Message-ID: On Fri, 20 May 2022 01:03:02 GMT, Dean Long wrote: > It isn't safe to compare a double to 2^63-1 because the latter will be changed to 2^63 when converting to a double. > This fixes a compiler warning with clang-12 and prevents the code from returning a negative value in some cases (observed on x64 with g++). I approve current fix because changing values limits of product flag requires CSR. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8798 From kvn at openjdk.java.net Fri May 20 20:15:56 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 20:15:56 GMT Subject: RFR: 8287052: comparing double to max_intx gives unexpected results In-Reply-To: References: Message-ID: On Fri, 20 May 2022 19:56:50 GMT, Dean Long wrote: > The java.1 man page says: "The CompileThresholdScaling option has a floating point value between 0 and +Inf", which is not quite correct. Maybe we should change this to be less specific, or remove this part completely. Also, I'm wonder if these kinds of changes will require a CSR. Yes, it requires CSR - it is documented product flag. Lets do it in next release and for now push what you currently have. I approve it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8798 From dlong at openjdk.java.net Fri May 20 20:24:56 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 20 May 2022 20:24:56 GMT Subject: RFR: 8287052: comparing double to max_intx gives unexpected results In-Reply-To: References: Message-ID: On Fri, 20 May 2022 02:49:15 GMT, Vladimir Kozlov wrote: > I think it is **insane** to allow DBL_MAX as upper limit for CompileThresholdScaling: > > ``` > range(0.0, DBL_MAX) > ``` > > It should be `(double)max_intx` since we can't have compilation threshold (which is integer) more than that. > > If limited to that, your checks for NAN and INF double values will not be needed because you will only can get max_intx^2 value. You can convert them to assert to catch case if `scale` is something different from `CompileThresholdScaling`. But I don't see such path in code. It looks like we can still get large (or infinite?) values because we don't seem to do a range check when using -XX:CompileCommand=CompileThresholdScaling,.... ------------- PR: https://git.openjdk.java.net/jdk/pull/8798 From dlong at openjdk.java.net Fri May 20 20:26:55 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 20 May 2022 20:26:55 GMT Subject: Integrated: 8287052: comparing double to max_intx gives unexpected results In-Reply-To: References: Message-ID: On Fri, 20 May 2022 01:03:02 GMT, Dean Long wrote: > It isn't safe to compare a double to 2^63-1 because the latter will be changed to 2^63 when converting to a double. > This fixes a compiler warning with clang-12 and prevents the code from returning a negative value in some cases (observed on x64 with g++). This pull request has now been integrated. Changeset: ba23f140 Author: Dean Long URL: https://git.openjdk.java.net/jdk/commit/ba23f14025f42bdb3bc831782b2f11443d1c572c Stats: 9 lines in 1 file changed: 6 ins; 2 del; 1 mod 8287052: comparing double to max_intx gives unexpected results Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8798 From sviswanathan at openjdk.java.net Fri May 20 22:01:48 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 22:01:48 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: On Wed, 18 May 2022 14:59:33 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 >> >> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from >> >> ucomiss xmm0, xmm0 >> jp label >> jne label >> >> into >> >> ucomiss xmm0, xmm0 >> jp label >> >> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> jnp done >> pushf >> andq [rsp], 0xffffff2b >> popf >> done: >> movl eax, 1 >> cmovel eax, ecx >> >> The patch changes this sequence into >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> movl eax, 1 >> cmovpl eax, ecx >> cmovnel eax, ecx >> >> 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. >> >> The benchmark results are as follow: >> >> Before: >> Benchmark Mode Cnt Score Error Units >> FPComparison.equalDouble avgt 5 2876.242 ? 58.875 ns/op >> FPComparison.equalFloat avgt 5 3062.430 ? 31.371 ns/op >> FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 ns/op >> FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 ns/op >> FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 ns/op >> FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 ns/op >> FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 ns/op >> FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 ns/op >> >> After: >> Benchmark Mode Cnt Score Error Units >> FPComparison.equalDouble avgt 5 594.636 ? 8.922 ns/op >> FPComparison.equalFloat avgt 5 663.849 ? 3.656 ns/op >> FPComparison.isFiniteDouble avgt 5 518.309 ? 107.352 ns/op >> FPComparison.isFiniteFloat avgt 5 515.576 ? 14.669 ns/op >> FPComparison.isInfiniteDouble avgt 5 621.185 ? 11.935 ns/op >> FPComparison.isInfiniteFloat avgt 5 623.566 ? 15.206 ns/op >> FPComparison.isNanDouble avgt 5 400.124 ? 0.762 ns/op >> FPComparison.isNanFloat avgt 5 546.486 ? 1.509 ns/op >> >> Thank you very much. > > I have reverted the changes to `java.lang.Float` and `java.lang.Double` to not interfere with the intrinsic PR. More tests are added to cover all cases regarding floating-point comparison of compiled code. > > The rules for fp comparison that output the result to `rFlagRegsU` are expensive and should be avoided. As a result, I removed the shortcut rules with memory or constant operands to reduce the number of match rules. Only the basic rules are kept. > > Thanks. @merykitty Very nice work! The patch looks good to me. @merykitty Very nice work! The patch looks good to me. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From sviswanathan at openjdk.java.net Fri May 20 22:26:55 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Fri, 20 May 2022 22:26:55 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v2] In-Reply-To: <6hRgLEWJfB8OHOYNJReUaMac469hvDuemoURK-aMy4Y=.963b7561-33e0-4069-8d89-8b447d4e0f0f@github.com> References: <6hRgLEWJfB8OHOYNJReUaMac469hvDuemoURK-aMy4Y=.963b7561-33e0-4069-8d89-8b447d4e0f0f@github.com> Message-ID: <4YL6D_2LfQ55wX12ygOKNUCqZo_okfJr0vUcumw5RBk=.6ddd5d8f-9c58-40d6-b6d8-43a65ec57828@github.com> On Wed, 4 May 2022 23:16:41 GMT, Vladimir Kozlov wrote: >> src/hotspot/cpu/x86/x86_64.ad line 6998: >> >>> 6996: ins_encode %{ >>> 6997: __ cmovl(Assembler::parity, $dst$$Register, $src$$Register); >>> 6998: __ cmovl(Assembler::notEqual, $dst$$Register, $src$$Register); >> >> Should this be `equal`? > > I see that you swapped `src, dst` in `match()` but `format` is sill incorrect and the code is confusing. I agree with @vnkozlov that this needs explanation. Could you please add comments here with IR and example code generated for both the eq case and ne case? You have some explanation in the PR description but not in the code. The description needs to be in the code as well for maintenance. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From kvn at openjdk.java.net Fri May 20 22:52:47 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 20 May 2022 22:52:47 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v2] In-Reply-To: <4YL6D_2LfQ55wX12ygOKNUCqZo_okfJr0vUcumw5RBk=.6ddd5d8f-9c58-40d6-b6d8-43a65ec57828@github.com> References: <6hRgLEWJfB8OHOYNJReUaMac469hvDuemoURK-aMy4Y=.963b7561-33e0-4069-8d89-8b447d4e0f0f@github.com> <4YL6D_2LfQ55wX12ygOKNUCqZo_okfJr0vUcumw5RBk=.6ddd5d8f-9c58-40d6-b6d8-43a65ec57828@github.com> Message-ID: On Fri, 20 May 2022 22:22:43 GMT, Sandhya Viswanathan wrote: >> I see that you swapped `src, dst` in `match()` but `format` is sill incorrect and the code is confusing. > > I agree with @vnkozlov that this needs explanation. Could you please add comments here with IR and example code generated for both the eq case and ne case? You have some explanation in the PR description but not in the code. The description needs to be in the code as well for maintenance. Right, I missed that. Then you can use `expand %{` to avoid duplication (keep format). ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From xliu at openjdk.java.net Fri May 20 23:37:58 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Fri, 20 May 2022 23:37:58 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> Message-ID: <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> On Fri, 20 May 2022 19:51:37 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 337, Arg escape = 125, Global escape = 2025 >> Objects scalar replaced = 217, Monitor objects removed = 29, GC barriers removed = 33, Memory barriers removed = 246 > > aamarsh has updated the pull request incrementally with one additional commit since the last revision: > > make count_MemBar static src/hotspot/share/opto/escape.cpp line 3756: > 3754: > 3755: void ConnectionGraph::print_statistics() { > 3756: // EA stats might be slightly off since objects might be double counted due to iterative EA I don't understand. your approach almost worked in last revision. you all need to do is to adjust for the last iteration. This revision drop it. I don't think it would be "slightly" off, even double counted is optimistic. A java object will be counted repeat if iterEA iterates N times. my concern is the final statistical counters become incomparable. Why did you drop snapshot approach in https://github.com/openjdk/jdk/pull/8019/commits/0805514aec4c3d0bd5ec935c089e315e4b37c7fa? ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Sat May 21 00:03:54 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 00:03:54 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> Message-ID: On Fri, 20 May 2022 23:34:25 GMT, Xin Liu wrote: >> aamarsh has updated the pull request incrementally with one additional commit since the last revision: >> >> make count_MemBar static > > src/hotspot/share/opto/escape.cpp line 3756: > >> 3754: >> 3755: void ConnectionGraph::print_statistics() { >> 3756: // EA stats might be slightly off since objects might be double counted due to iterative EA > > I don't understand. your approach almost worked in last revision. All you need to do is to adjust for the last iteration. > > This revision drops it. I don't think it would be "slightly" off, even double counted is optimistic. A java object will be counted repeat if iterEA iterates N times. My concern is the final statistical counters become incomparable. > > Why did you drop snapshot approach in https://github.com/openjdk/jdk/pull/8019/commits/0805514aec4c3d0bd5ec935c089e315e4b37c7fa? @navyxliu, please, explain what do you mean "snapshot approach" and suggest how we should do it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From xliu at openjdk.java.net Sat May 21 05:15:45 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Sat, 21 May 2022 05:15:45 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> Message-ID: On Fri, 20 May 2022 23:59:58 GMT, Vladimir Kozlov wrote: >> src/hotspot/share/opto/escape.cpp line 3756: >> >>> 3754: >>> 3755: void ConnectionGraph::print_statistics() { >>> 3756: // EA stats might be slightly off since objects might be double counted due to iterative EA >> >> I don't understand. your approach almost worked in last revision. All you need to do is to adjust for the last iteration. >> >> This revision drops it. I don't think it would be "slightly" off, even double counted is optimistic. A java object will be counted repeat if iterEA iterates N times. My concern is the final statistical counters become incomparable. >> >> Why did you drop snapshot approach in https://github.com/openjdk/jdk/pull/8019/commits/0805514aec4c3d0bd5ec935c089e315e4b37c7fa? > > @navyxliu, please, explain what do you mean "snapshot approach" and suggest how we should do it. In previous revision, @aamarsh used 3 data members of Compile. This tuple is a snapshot. _local_no_escape_ctr _local_arg_escape_ctr _local_global_escape_ctr ConnectionGraph::escape_state_statistics() initializes them all zeros and categorize Java objects of the connection graph. here is the framework. do { EscapeAnalysis(); MacroExpand.eliminate_macro_nodes(); } while (progress?); #ifndef PRODUCT Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced); Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); #endif Both you and @JohnTortugo pointed out that there was a bug in previous revision. We overlook that non-escaped objects are double-counted in last iteration. I think it's amendable. I call this snapshot approach because it uses the snapshot of last iteration to update global statistical counters. all intermediate snapshots are drop. Allow me to write down @aamarsh 's approach. The number of Java object `JO` is from user-program by nature. EscapeAnalysis breaks them down into 3 categories. non-escaped, arg-escaped and global escaped. Without iterative EA, we can report this snapshot to statistical counters. With iterative EA, a problem arise. Some objects elided in previous iteration of MacroExpansion. if we use last snapshot, we have to add those eliminated java objects. Let's say iterative EA iterates N times in total. `E` is the number of eliminated java object from 1 to N-1 iterations (exclude the last iteration here). Since all eliminated objects must be non-escaped, so we add E back to _no_escape_counter. We account for all java objects of this compilation unit. Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + E); Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); In this way, We keep track of all java objects of this CU for iterative EA. I think it's accurate. JO = _local_no_escape_ctr + E + _local_arg_escape_ctr + _local_global_escape_ctr I suggest to hoist the variable `PhaseMacroExpand mexp` out of loop and make it stateful to track E. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Sat May 21 05:20:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 05:20:40 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive In-Reply-To: References: Message-ID: On Mon, 9 May 2022 05:23:14 GMT, yuta wrote: > When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , > it shows totally the same message for c1 and c2 compiler and the user would be confused about > which compiler is affected by this message. > This should show messages with their compiler name so that the user knows which compiler shows this message. > > My change result would be like the below. > > > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > -> > > OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output Seems reasonable. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8591 From duke at openjdk.java.net Sat May 21 07:46:51 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Sat, 21 May 2022 07:46:51 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Fri, 29 Apr 2022 00:36:18 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains ten commits: >> >> - add comment for vfpclasss/d for isFinite() >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into float >> - zero out the upper bits not written by setb >> - use 0x1 to be simpler >> - remove the redundant temp register >> - Split the macros using predicate >> - update jmh tests >> - Merge branch 'master' into float >> - 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite > > Impressive. Few comments. > > You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. > > And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. > > Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. > > Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? > > You can reuse `tmp1` in `double_class_check()`. Hi Vladimir (@vnkozlov) For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. Is it Ok to skip support for **non** `vfpclassd` for 32bit? void C2_MacroAssembler::double_class_check_sse(int opcode, XMMRegister src, Register dst, Register temp, Register temp1) { int32_t POS_INF_HI = 0x7ff00000; // hi 32bits int32_t KILL_SIGN_MASK_HI = 0x7fffffff; // hi 32 bits pshuflw(src, src, 0x4e); //switch hi to lo movdl(temp, src); movl(temp1, KILL_SIGN_MASK_HI); andl(temp, temp1); movl(temp1, POS_INF_HI); cmpl(temp, temp1); switch (opcode) { case Op_IsFiniteD: setb(Assembler::below, dst); break; case Op_IsInfiniteD: setb(Assembler::equal, dst); break; case Op_IsNaND: setb(Assembler::above, dst); break; default: assert(false, "%s", NodeClassNames[opcode]); } andl(dst, 0xff); } ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Sat May 21 07:46:54 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Sat, 21 May 2022 07:46:54 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: <3NkjeRn9vwaUa2t6-oVuunKBRhR539zc0K-Xjw7MWZ0=.3e5111c3-c932-4a28-b5c1-8ac223c5c380@github.com> On Fri, 20 May 2022 08:46:06 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4750: >> >>> 4748: movdl(temp, src); >>> 4749: andl(temp, KILL_SIGN_MASK); >>> 4750: cmpl(temp, POS_INF); >> >> For IsNaN following sequence will offer better latency >> "vucomiss src_xmm, src_xmm" >> "setp r8" > > Hi @vamsi-parasa , > > I can see almost 30% improvement in you JMH micro for IsNaN through above suggested change. > Original score over AVX2 machine: > > Benchmark Mode Cnt Score Error Units > FloatClassCheck.testIsNaN avgt 2 0.854 ns/op > FloatClassCheck.testIsNaN:?asm avgt NaN --- > > With new sequence: > Benchmark Mode Cnt Score Error Units > FloatClassCheck.testIsNaN avgt 2 0.570 ns/op > FloatClassCheck.testIsNaN:?asm avgt NaN --- Thanks Jatin! Will test it on my machine as well... ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Sat May 21 09:47:55 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sat, 21 May 2022 09:47:55 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 07:40:20 GMT, Srinivas Vamsi Parasa wrote: >> Impressive. Few comments. >> >> You are testing performance of storing `boolean` results into array but usually these Java methods used in conditions. Measuring that will be more real word case. For both case: with `avx512dq` On and OFF. >> >> And you need to post you perf results at least in RFE. Please, also show what instructions are currently generated vs your changes. I don't get how you made `isNaN()` faster - you generate more instructions is seems. >> >> Instead of 3 new Ideal nodes per type you can use one and store instrinsic id (or other enum) in its field which you can read in `.ad` file instructions. Instead I suggest to split those mach instructions based on `avx512dq` support to avoid unused registers killing. >> >> Why Double type support is limited to LP64? Why there is no `x86_32.ad` changes? >> >> You can reuse `tmp1` in `double_class_check()`. > > Hi Vladimir (@vnkozlov) > > For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. > > Is it Ok to skip support for **non** `vfpclassd` for 32bit? > > > void C2_MacroAssembler::double_class_check_sse(int opcode, XMMRegister src, Register dst, Register temp, Register temp1) { > int32_t POS_INF_HI = 0x7ff00000; // hi 32bits > int32_t KILL_SIGN_MASK_HI = 0x7fffffff; // hi 32 bits > > pshuflw(src, src, 0x4e); //switch hi to lo > movdl(temp, src); > movl(temp1, KILL_SIGN_MASK_HI); > andl(temp, temp1); > movl(temp1, POS_INF_HI); > cmpl(temp, temp1); > switch (opcode) { > case Op_IsFiniteD: > setb(Assembler::below, dst); > break; > case Op_IsInfiniteD: > setb(Assembler::equal, dst); > break; > case Op_IsNaND: > setb(Assembler::above, dst); > break; > default: > assert(false, "%s", NodeClassNames[opcode]); > } > andl(dst, 0xff); > } @vamsi-parasa I modified your benchmark to emulate more use cases of these functions and run it on the baseline, #8525 with modified `isInfinite` (to use `Math.abs(v) > MAX_VALUE` instead) and this patch. The result is as follows, the source code and the assembly for the interesting parts will be shown later Baseline #8459 #8525 Benchmark Mode Cnt Score Error Score Error Score Error Units FloatClassCheck.testIsFiniteBranch avgt 5 2.522 ? 0.094 2.564 ? 0.187 2.512 ? 0.137 ns/op FloatClassCheck.testIsFiniteCMov avgt 5 0.479 ? 0.014 0.786 ? 0.009 0.475 ? 0.005 ns/op FloatClassCheck.testIsFiniteStore avgt 5 0.482 ? 0.010 0.603 ? 0.026 0.480 ? 0.006 ns/op FloatClassCheck.testIsInfiniteBranch avgt 5 1.921 ? 0.043 1.778 ? 0.023 1.767 ? 0.039 ns/op FloatClassCheck.testIsInfiniteCMov avgt 5 1.124 ? 0.045 0.787 ? 0.013 0.622 ? 0.019 ns/op FloatClassCheck.testIsInfiniteStore avgt 5 1.195 ? 0.033 0.602 ? 0.015 0.625 ? 0.033 ns/op FloatClassCheck.testIsNaNBranch avgt 5 1.896 ? 0.182 2.097 ? 0.216 1.725 ? 0.222 ns/op FloatClassCheck.testIsNaNCMov avgt 5 2.956 ? 0.021 0.856 ? 0.003 0.390 ? 0.006 ns/op FloatClassCheck.testIsNaNStore avgt 5 3.024 ? 0.071 0.741 ? 0.139 0.410 ? 0.008 ns/op Baseline #8459 #8525 Benchmark Mode Cnt Score Error Score Error Score Error Units DoubleClassCheck.testIsFiniteBranch avgt 5 2.566 ? 0.105 3.023 ? 0.117 2.603 ? 0.137 ns/op DoubleClassCheck.testIsFiniteCMov avgt 5 0.481 ? 0.010 0.978 ? 0.011 0.485 ? 0.018 ns/op DoubleClassCheck.testIsFiniteStore avgt 5 0.480 ? 0.012 0.943 ? 0.012 0.486 ? 0.011 ns/op DoubleClassCheck.testIsInfiniteBranch avgt 5 1.907 ? 0.081 1.917 ? 0.065 1.808 ? 0.039 ns/op DoubleClassCheck.testIsInfiniteCMov avgt 5 1.111 ? 0.028 0.982 ? 0.019 0.630 ? 0.017 ns/op DoubleClassCheck.testIsInfiniteStore avgt 5 1.134 ? 0.011 0.944 ? 0.017 0.630 ? 0.009 ns/op DoubleClassCheck.testIsNaNBranch avgt 5 1.926 ? 0.218 2.193 ? 0.045 1.767 ? 0.142 ns/op DoubleClassCheck.testIsNaNCMov avgt 5 2.944 ? 0.020 1.047 ? 0.012 0.392 ? 0.009 ns/op DoubleClassCheck.testIsNaNStore avgt 5 3.011 ? 0.065 0.946 ? 0.029 0.411 ? 0.004 ns/op The source code for `FloatClassCheck`, that of `DoubleClassCheck` is similar RandomGenerator rng; static final int BUFFER_SIZE = 1024; float[] inputs; boolean[] storeOutputs; int[] cmovOutputs; int[] branchOutputs; @CompilerControl(CompilerControl.Mode.DONT_INLINE) static int call() { return 1; } @Setup public void setup() { storeOutputs = new boolean[BUFFER_SIZE]; cmovOutputs = new int[BUFFER_SIZE]; branchOutputs = new int[BUFFER_SIZE]; inputs = new float[BUFFER_SIZE]; RandomGenerator rng = RandomGeneratorFactory.getDefault().create(0); float input; for (int i = 0; i < BUFFER_SIZE; i++) { if (i % 5 == 0) { input = (i%2 == 0) ? Float.NEGATIVE_INFINITY : Float.POSITIVE_INFINITY; } else if (i % 3 == 0) input = Float.NaN; else input = rng.nextFloat(); inputs[i] = input; } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsFiniteStore() { for (int i = 0; i < BUFFER_SIZE; i++) { storeOutputs[i] = Float.isFinite(inputs[i]); } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsInfiniteStore() { for (int i = 0; i < BUFFER_SIZE; i++) { storeOutputs[i] = Float.isInfinite(inputs[i]); } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsNaNStore() { for (int i = 0; i < BUFFER_SIZE; i++) { storeOutputs[i] = Float.isNaN(inputs[i]); } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsFiniteCMov() { for (int i = 0; i < BUFFER_SIZE; i++) { cmovOutputs[i] = Float.isFinite(inputs[i]) ? 9 : 7; } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsInfiniteCMov() { for (int i = 0; i < BUFFER_SIZE; i++) { cmovOutputs[i] = Float.isInfinite(inputs[i]) ? 9 : 7; } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsNaNCMov() { for (int i = 0; i < BUFFER_SIZE; i++) { cmovOutputs[i] = Float.isNaN(inputs[i]) ? 9 : 7; } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsFiniteBranch() { for (int i = 0; i < BUFFER_SIZE; i++) { cmovOutputs[i] = Float.isFinite(inputs[i]) ? call() : 7; } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsInfiniteBranch() { for (int i = 0; i < BUFFER_SIZE; i++) { cmovOutputs[i] = Float.isInfinite(inputs[i]) ? call() : 7; } } @Benchmark @OperationsPerInvocation(BUFFER_SIZE) public void testIsNaNBranch() { for (int i = 0; i < BUFFER_SIZE; i++) { cmovOutputs[i] = Float.isNaN(inputs[i]) ? call() : 7; } } The assembly of the interesting parts of the executions: FloatClassCheck::testIsFiniteBranch: Baseline, #8525: vandps -0xd4791(%rip), %xmm0, %xmm1 vucomiss %xmm1, %xmm2 jae -0x77 #8459 vmovd %xmm0, %r11d andl $0x7fffffff, %r11d cmpl $0x7f800000, %r11d setb %r10b andl $0xff, %r10d testl %r10d, %r10d jne -0x90 FloatClassCheck::testIsFiniteCMov: Baseline, #8525: vandps -0xcdcf6(%rip), %xmm6, %xmm7 vucomiss %xmm7, %xmm10 movl $0x9, %r10d cmovbl %r14d, %r10d #8459: vmovd %xmm4, %r9d andl $0x7fffffff, %r9d cmpl $0x7f800000, %r9d setb %r8b andl $0xff, %r8d testl %r8d, %r8d movl $0x7, %ebx cmovnel %r14d, %ebx FloatClassCheck::isFiniteStore: Baseline, #8525: vandps -0xcfd74(%rip), %xmm3, %xmm3 movl $0x1, %r10d vucomiss %xmm3, %xmm0 cmovbl %r9d, %r10d #8459: vmovd %xmm6, %edi andl $0x7fffffff, %edi cmpl $0x7f800000, %edi setb %dil andl $0xff, %edi FloatClassCheck::isInfiniteBranch: Baseline: vucomiss -0xc8(%rip), %xmm1 jp 0x2 je 0x20 vucomiss -0xd0(%rip), %xmm1 nopl (%rax,%rax) nop jp -0x86 jne -0x8c #8459: vmovd %xmm1, %r10d andl $0x7fffffff, %r10d cmpl $0x7f800000, %r10d sete %r11b andl $0xff, %r11d testl %r11d, %r11d je -0x87 #8525: vandps -0xce478(%rip), %xmm1, %xmm0 nopl (%rax,%rax) vucomiss -0xc8(%rip), %xmm0 jbe -0x76 FloatClassCheck::isInfiniteCMov: Baseline: vucomiss -0x128(%rip), %xmm1 jp 0x2 je 0x16 vucomiss -0x130(%rip), %xmm1 jp 0x2 je 0xa #8459: vmovd %xmm5, %eax andl $0x7fffffff, %eax cmpl $0x7f800000, %eax sete %bpl andl $0xff, %ebp testl %ebp, %ebp movl $0x7, %eax cmovnel %ebx, %eax #8525: vandps -0xcefc3(%rip), %xmm0, %xmm0 vucomiss -0x12b(%rip), %xmm0 movl $0x9, %esi cmovbel %r8d, %esi FloatClassCheck::isInfiniteStore: Baseline: vucomiss -0x128(%rip), %xmm0 jp 0x2 je 0x11 vucomiss -0x130(%rip), %xmm0 jp 0x2 je 0x5 #8459: vmovd %xmm2, %r8d andl $0x7fffffff, %r8d cmpl $0x7f800000, %r8d sete %r8b andl $0xff, %r8d #8525: vandps -0xcf2b9(%rip), %xmm0, %xmm0 vucomiss -0x121(%rip), %xmm0 movl $0x1, %r11d cmovbel %esi, %r11d FloatClassCheck::isNaNBranch: Baseline: vucomiss %xmm0, %xmm0 jp 0x2 je -0x64 #8459: vmovd %xmm1, %r10d andl $0x7fffffff, %r10d cmpl $0x7f800000, %r10d seta %r11b andl $0xff, %r11d testl %r11d, %r11d je -0x87 #8525: vucomiss %xmm1, %xmm1 jnp -0x62 FloatClassCheck::isNaNCMov: Baseline: vucomiss %xmm5, %xmm5 jnp 0xa pushfq andq $-0xd5, (%rsp) popfq movl $0x7, %r9d cmovnel %r8d, %r9d #8459: vmovd %xmm4, %ebp andl $0x7fffffff, %ebp cmpl $0x7f800000, %ebp seta %al andl $0xff, %eax testl %eax, %eax movl $0x7, %ebp cmovnel %ebx, %ebp #8525: vucomiss %xmm4, %xmm4 movl $0x7, %r9d cmovpl %r8d, %r9d FloatClassCheck::isNaNStore: Baseline: vucomiss %xmm3, %xmm3 jnp 0xa pushfq andq $-0xd5, (%rsp) popfq movl $0x1, %ebx cmovel %eax, %ebx #8459: vmovd %xmm6, %edi andl $0x7fffffff, %edi cmpl $0x7f800000, %edi seta %dil andl $0xff, %edi #8525: movl $0x1, %r9d vucomiss %xmm0, %xmm0 cmovnpl %r10d, %r9d The assembly output for `DoubleClassCheck` is similar. Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Sat May 21 10:31:25 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sat, 21 May 2022 10:31:25 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3] In-Reply-To: References: Message-ID: > Hi, > > This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 > > 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from > > ucomiss xmm0, xmm0 > jp label > jne label > > into > > ucomiss xmm0, xmm0 > jp label > > 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as > > xorl ecx, ecx > ucomiss xmm0, xmm1 > jnp done > pushf > andq [rsp], 0xffffff2b > popf > done: > movl eax, 1 > cmovel eax, ecx > > The patch changes this sequence into > > xorl ecx, ecx > ucomiss xmm0, xmm1 > movl eax, 1 > cmovpl eax, ecx > cmovnel eax, ecx > > 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. > > The benchmark results are as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Unit Ratio > FPComparison.equalDouble avgt 5 2876.242 ? 58.875 594.636 ? 8.922 ns/op 4.84 > FPComparison.equalFloat avgt 5 3062.430 ? 31.371 663.849 ? 3.656 ns/op 4.61 > FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 518.309 ? 107.352 ns/op 0.92 > FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 515.576 ? 14.669 ns/op 0.98 > FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 621.185 ? 11.935 ns/op 1.98 > FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 623.566 ? 15.206 ns/op 1.98 > FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 400.124 ? 0.762 ns/op 5.64 > FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 546.486 ? 1.509 ns/op 4.70 > > Thank you very much. Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8525/files - new: https://git.openjdk.java.net/jdk/pull/8525/files/ba93dcf2..7fcfe4a3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8525&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8525&range=01-02 Stats: 11 lines in 1 file changed: 10 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8525.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8525/head:pull/8525 PR: https://git.openjdk.java.net/jdk/pull/8525 From duke at openjdk.java.net Sat May 21 10:46:48 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sat, 21 May 2022 10:46:48 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3] In-Reply-To: References: Message-ID: On Wed, 4 May 2022 23:27:45 GMT, Vladimir Kozlov wrote: >> The changes to `Float` and `Double` look good. I don't think we need additional tests, see test/jdk/java/lang/Math/IeeeRecommendedTests.java. >> >> At first i thought we no longer need PR #8459 but it seems both PRs are complimentary, albeit PR #8459 has more modest performance gains for the intrinsics. > >> The changes to `Float` and `Double` look good. I don't think we need additional tests, see test/jdk/java/lang/Math/IeeeRecommendedTests.java. > > Thank you, Paul for pointing the test. It means we need to run tier4 (which runs these tests with -Xcomp) to make sure methods are compiled by C2. @vnkozlov I have added comments to describe the changes in `cmpOpUCF` and the reasons behind the `cmov_regUCF2_eq` match rules. Using `expand` broke the build with `Syntax Error: :For expand in cmovI_regUCF2_eq to work, parameter declaration order in cmovI_regUCF2_ne must follow matchrule`. @sviswa7 (x != y) ? a : b can be calculated using pseudocode as follow: res = (!ZF || PF) ? a : b = !ZF ? a : (PF ? a : b) which can be calculated using cmovp rb, ra // rb1 = PF ? ra : rb cmovne rb, ra // rb2 = !ZF ? ra : rb1 = !ZF ? ra : (PF ? ra : rb) Furthermore, since `(x == y) == !(x != y)`, we have `((x == y) ? a : b) == ((x != y) ? b : a)`, which explains the implementation of `cmov_regUCF2_eq`. @vamsi-parasa Thanks a lot for your suggestion, I have modified the PR description as you say. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From duke at openjdk.java.net Sat May 21 12:16:23 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Sat, 21 May 2022 12:16:23 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: > Hi, > > The current peephole mechanism has several drawbacks: > - Can only match and remove adjacent instructions. > - Cannot match machine ideal nodes (e.g MachSpillCopyNode). > - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. > - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. > > The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. > > The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: > > mov r1, r2 -> lea r1, [r2 + r3/i] > add r1, r3/i > > and > > mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 > shl r1, i > > On the added benchmarks, the transformations show positive results: > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op > LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op > LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op > LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op > LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op > LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op > > Benchmark Mode Cnt Score Error Units > LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op > LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op > LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op > LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op > LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op > LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op > > A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. > > Thank you very much. Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: - Merge branch 'master' into peephole - some fix - add benchmark - Merge branch 'master' into peephole - refactor - fix? - refactor - attempt - attempt - build fix - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8025/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8025&range=01 Stats: 1002 lines in 22 files changed: 856 ins; 24 del; 122 mod Patch: https://git.openjdk.java.net/jdk/pull/8025.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8025/head:pull/8025 PR: https://git.openjdk.java.net/jdk/pull/8025 From kvn at openjdk.java.net Sat May 21 15:27:53 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 15:27:53 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 05:09:41 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution My testing passed. You are good to push. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From kvn at openjdk.java.net Sat May 21 15:33:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 15:33:41 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 10:31:25 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 >> >> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from >> >> ucomiss xmm0, xmm0 >> jp label >> jne label >> >> into >> >> ucomiss xmm0, xmm0 >> jp label >> >> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> jnp done >> pushf >> andq [rsp], 0xffffff2b >> popf >> done: >> movl eax, 1 >> cmovel eax, ecx >> >> The patch changes this sequence into >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> movl eax, 1 >> cmovpl eax, ecx >> cmovnel eax, ecx >> >> 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. >> >> The benchmark results are as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Unit Ratio >> FPComparison.equalDouble avgt 5 2876.242 ? 58.875 594.636 ? 8.922 ns/op 4.84 >> FPComparison.equalFloat avgt 5 3062.430 ? 31.371 663.849 ? 3.656 ns/op 4.61 >> FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 518.309 ? 107.352 ns/op 0.92 >> FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 515.576 ? 14.669 ns/op 0.98 >> FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 621.185 ? 11.935 ns/op 1.98 >> FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 623.566 ? 15.206 ns/op 1.98 >> FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 400.124 ? 0.762 ns/op 5.64 >> FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 546.486 ? 1.509 ns/op 4.70 >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > comments Thank you for trying my suggestion and new comments. Approved. I started our testing. Please, wait results. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8525 From kvn at openjdk.java.net Sat May 21 15:45:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 15:45:52 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 07:40:20 GMT, Srinivas Vamsi Parasa wrote: > For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. > > Is it Ok to skip support for **non** `vfpclassd` for 32bit? Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Sat May 21 15:58:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 15:58:52 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> Message-ID: On Sat, 21 May 2022 05:12:08 GMT, Xin Liu wrote: >> @navyxliu, please, explain what do you mean "snapshot approach" and suggest how we should do it. > > In previous revision, @aamarsh used 3 data members of Compile. This tuple is a snapshot. > > > _local_no_escape_ctr > _local_arg_escape_ctr > _local_global_escape_ctr > > > ConnectionGraph::escape_state_statistics() initializes them all zeros and categorize Java objects of the connection graph. > > here is the framework. > > do { > EscapeAnalysis(); > MacroExpand.eliminate_macro_nodes(); > } while (progress?); > > #ifndef PRODUCT > Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced); > Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); > Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); > #endif > > Both you and @JohnTortugo pointed out that there was a bug in previous revision. We overlook that non-escaped objects are double-counted in last iteration. I think it's amendable. > > I call this snapshot approach because it uses the snapshot of last iteration to update global statistical counters. all intermediate snapshots are drop. Allow me to write down @aamarsh 's approach. > > The number of Java object `JO` is from user-program by nature. EscapeAnalysis breaks them down into 3 categories. non-escaped, arg-escaped and global escaped. Without iterative EA, we can report this snapshot to statistical counters. > > With iterative EA, a problem arise. Some objects elided in previous iteration of MacroExpansion. if we use last snapshot, we have to add those eliminated java objects. Let's say iterative EA iterates N times in total. `E` is the number of eliminated java object from 1 to N-1 iterations (exclude the last iteration here). > > Since all eliminated objects must be non-escaped, so we add E back to _no_escape_counter. We account for all java objects of this compilation unit. > > Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + E); > Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); > Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); > > > In this way, We keep track of all java objects of this CU for iterative EA. I think it's accurate. > > JO = _local_no_escape_ctr + E + _local_arg_escape_ctr + _local_global_escape_ctr > > > I suggest to hoist the variable `PhaseMacroExpand mexp` out of loop and make it stateful to track E. Got it. Thank for explaining - you simply used data from last iteration. I think you can do similar with much less complexity if you used data from **first** iteration and current code change: void ConnectionGraph::escape_state_statistics(GrowableArray& java_objects_worklist) { if (!PrintOptoStatistics || (_invocation > 0)) { // Collect data only for first invocation return; } for (int next = 0; next < java_objects_worklist.length(); ++next) { ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Sat May 21 15:58:53 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 15:58:53 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> Message-ID: On Sat, 21 May 2022 15:52:09 GMT, Vladimir Kozlov wrote: >> In previous revision, @aamarsh used 3 data members of Compile. This tuple is a snapshot. >> >> >> _local_no_escape_ctr >> _local_arg_escape_ctr >> _local_global_escape_ctr >> >> >> ConnectionGraph::escape_state_statistics() initializes them all zeros and categorize Java objects of the connection graph. >> >> here is the framework. >> >> do { >> EscapeAnalysis(); >> MacroExpand.eliminate_macro_nodes(); >> } while (progress?); >> >> #ifndef PRODUCT >> Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + total_scalar_replaced); >> Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); >> Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); >> #endif >> >> Both you and @JohnTortugo pointed out that there was a bug in previous revision. We overlook that non-escaped objects are double-counted in last iteration. I think it's amendable. >> >> I call this snapshot approach because it uses the snapshot of last iteration to update global statistical counters. all intermediate snapshots are drop. Allow me to write down @aamarsh 's approach. >> >> The number of Java object `JO` is from user-program by nature. EscapeAnalysis breaks them down into 3 categories. non-escaped, arg-escaped and global escaped. Without iterative EA, we can report this snapshot to statistical counters. >> >> With iterative EA, a problem arise. Some objects elided in previous iteration of MacroExpansion. if we use last snapshot, we have to add those eliminated java objects. Let's say iterative EA iterates N times in total. `E` is the number of eliminated java object from 1 to N-1 iterations (exclude the last iteration here). >> >> Since all eliminated objects must be non-escaped, so we add E back to _no_escape_counter. We account for all java objects of this compilation unit. >> >> Atomic::add(&ConnectionGraph::_no_escape_counter, _local_no_escape_ctr + E); >> Atomic::add(&ConnectionGraph::_arg_escape_counter, _local_arg_escape_ctr); >> Atomic::add(&ConnectionGraph::_global_escape_counter, _local_global_escape_ctr); >> >> >> In this way, We keep track of all java objects of this CU for iterative EA. I think it's accurate. >> >> JO = _local_no_escape_ctr + E + _local_arg_escape_ctr + _local_global_escape_ctr >> >> >> I suggest to hoist the variable `PhaseMacroExpand mexp` out of loop and make it stateful to track E. > > Got it. Thank for explaining - you simply used data from last iteration. > I think you can do similar with much less complexity if you used data from **first** iteration and current code change: > > void ConnectionGraph::escape_state_statistics(GrowableArray& java_objects_worklist) { > if (!PrintOptoStatistics || (_invocation > 0)) { // Collect data only for first invocation > return; > } > for (int next = 0; next < java_objects_worklist.length(); ++next) { You need to revert `escape_state_statistics` back to non static for this of cause. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Sat May 21 16:46:58 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 16:46:58 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: <8gOmfsbM9h5kbBCOw7pJjIJGHY6pXECXFn9f7aG2Els=.69417724-bdcd-4259-a209-37b3f7d6f790@github.com> On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 Very interesting. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From kvn at openjdk.java.net Sat May 21 19:20:50 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 21 May 2022 19:20:50 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 15:30:29 GMT, Vladimir Kozlov wrote: > I started our testing. Please, wait results. Testing passed clean. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From xliu at openjdk.java.net Sun May 22 03:08:59 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Sun, 22 May 2022 03:08:59 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> Message-ID: <2qasptODq41KF224gC9HuP1q4Yju1guGRjOmUK2H2tk=.536ac560-d9be-4022-b028-716c10d2cb7a@github.com> On Sat, 21 May 2022 15:55:44 GMT, Vladimir Kozlov wrote: >> Got it. Thank for explaining - you simply used data from last iteration. >> I think you can do similar with much less complexity if you used data from **first** iteration and current code change: >> >> void ConnectionGraph::escape_state_statistics(GrowableArray& java_objects_worklist) { >> if (!PrintOptoStatistics || (_invocation > 0)) { // Collect data only for first invocation >> return; >> } >> for (int next = 0; next < java_objects_worklist.length(); ++next) { > > You need to revert `escape_state_statistics` back to non static for this of cause. so you use the snapshot of first iteration. We assume Iterative EA can elide some non-escaped objects but can't change any object's category. I think it is correct and indeed much simpler! ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Mon May 23 04:19:50 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Mon, 23 May 2022 04:19:50 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v14] In-Reply-To: <2qasptODq41KF224gC9HuP1q4Yju1guGRjOmUK2H2tk=.536ac560-d9be-4022-b028-716c10d2cb7a@github.com> References: <7ffUzFtOeZBXuqaImQG0u9CLQTGp2u4sVxPtFtxBxck=.42e904a4-ba7b-46df-9da1-a4bddd8a1cab@github.com> <9FUQAr0xmQBG3wOpjJeBKaDQYmUA5Cd7liqZ0HhHCbo=.9ea5cc3f-0250-4698-99bf-681e285696bd@github.com> <2qasptODq41KF224gC9HuP1q4Yju1guGRjOmUK2H2tk=.536ac560-d9be-4022-b028-716c10d2cb7a@github.com> Message-ID: On Sun, 22 May 2022 03:05:30 GMT, Xin Liu wrote: >> You need to revert `escape_state_statistics` back to non static for this of cause. > > so you use the snapshot of first iteration. We assume Iterative EA can elide some non-escaped objects but can't change any object's category. I think it is correct and indeed much simpler! That was a great idea. Thank you all. I'll work with @aamarsh to do the changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From duke at openjdk.java.net Mon May 23 04:57:02 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Mon, 23 May 2022 04:57:02 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: References: Message-ID: <36atcyahkqDuS1uJui1trz8Zq-8PiySyaDDHEJ9wd48=.695e9883-d1f8-4a86-9aa8-0482381aaf0f@github.com> On Sat, 21 May 2022 15:42:34 GMT, Vladimir Kozlov wrote: > > For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. > > Is it Ok to skip support for **non** `vfpclassd` for 32bit? > > Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`. Thanks Vladimir! Will add a comment that the intrinsic doesn't give speedup without `vfpclasssd`. Yes, the check for `predicate(UseSSE>=2)` was added in the macro shown below. instruct DoubleClassCheck_reg_reg_sse(rRegI dst, regD src, rRegI tmp, rRegI tmp1, rFlagsReg cr) %{ predicate(UseSSE>=2); match(Set dst (IsInfiniteD src)); match(Set dst (IsNaND src)); match(Set dst (IsFiniteD src)); effect(TEMP tmp, TEMP tmp1, KILL cr); format %{ "double_class_check $dst, $src" %} ins_encode %{ int opcode = this->ideal_Opcode(); __ double_class_check_sse(opcode, $src$$XMMRegister, $dst$$Register, $tmp$$Register, $tmp1$$Register); %} ins_pipe(pipe_slow); %} ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Mon May 23 05:15:52 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Mon, 23 May 2022 05:15:52 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v8] In-Reply-To: <36atcyahkqDuS1uJui1trz8Zq-8PiySyaDDHEJ9wd48=.695e9883-d1f8-4a86-9aa8-0482381aaf0f@github.com> References: <36atcyahkqDuS1uJui1trz8Zq-8PiySyaDDHEJ9wd48=.695e9883-d1f8-4a86-9aa8-0482381aaf0f@github.com> Message-ID: On Mon, 23 May 2022 04:52:41 GMT, Srinivas Vamsi Parasa wrote: >>> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. >>> >>> Is it Ok to skip support for **non** `vfpclassd` for 32bit? >> >> Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`. > >> > For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. >> > Is it Ok to skip support for **non** `vfpclassd` for 32bit? >> >> Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`. > > Thanks Vladimir! Will add a comment that the intrinsic doesn't give speedup without `vfpclasssd`. > Yes, the check for `predicate(UseSSE>=2)` was added in the macro shown below. > > instruct DoubleClassCheck_reg_reg_sse(rRegI dst, regD src, rRegI tmp, rRegI tmp1, rFlagsReg cr) > %{ > predicate(UseSSE>=2); > match(Set dst (IsInfiniteD src)); > match(Set dst (IsNaND src)); > match(Set dst (IsFiniteD src)); > effect(TEMP tmp, TEMP tmp1, KILL cr); > format %{ "double_class_check $dst, $src" %} > ins_encode %{ > int opcode = this->ideal_Opcode(); > __ double_class_check_sse(opcode, $src$$XMMRegister, $dst$$Register, $tmp$$Register, > $tmp1$$Register); > %} > ins_pipe(pipe_slow); > %} > @vamsi-parasa I modified your benchmark to emulate more use cases of these functions and run it on the baseline, #8525 with modified `isInfinite` (to use `Math.abs(v) > MAX_VALUE` instead) and this patch. The result is as follows, the source code and the assembly for the interesting parts will be shown later > > ``` > Baseline #8459 #8525 > Benchmark Mode Cnt Score Error Score Error Score Error Units > FloatClassCheck.testIsFiniteBranch avgt 5 2.522 ? 0.094 2.564 ? 0.187 2.512 ? 0.137 ns/op > FloatClassCheck.testIsFiniteCMov avgt 5 0.479 ? 0.014 0.786 ? 0.009 0.475 ? 0.005 ns/op > FloatClassCheck.testIsFiniteStore avgt 5 0.482 ? 0.010 0.603 ? 0.026 0.480 ? 0.006 ns/op > FloatClassCheck.testIsInfiniteBranch avgt 5 1.921 ? 0.043 1.778 ? 0.023 1.767 ? 0.039 ns/op > FloatClassCheck.testIsInfiniteCMov avgt 5 1.124 ? 0.045 0.787 ? 0.013 0.622 ? 0.019 ns/op > FloatClassCheck.testIsInfiniteStore avgt 5 1.195 ? 0.033 0.602 ? 0.015 0.625 ? 0.033 ns/op > FloatClassCheck.testIsNaNBranch avgt 5 1.896 ? 0.182 2.097 ? 0.216 1.725 ? 0.222 ns/op > FloatClassCheck.testIsNaNCMov avgt 5 2.956 ? 0.021 0.856 ? 0.003 0.390 ? 0.006 ns/op > FloatClassCheck.testIsNaNStore avgt 5 3.024 ? 0.071 0.741 ? 0.139 0.410 ? 0.008 ns/op > > Baseline #8459 #8525 > Benchmark Mode Cnt Score Error Score Error Score Error Units > DoubleClassCheck.testIsFiniteBranch avgt 5 2.566 ? 0.105 3.023 ? 0.117 2.603 ? 0.137 ns/op > DoubleClassCheck.testIsFiniteCMov avgt 5 0.481 ? 0.010 0.978 ? 0.011 0.485 ? 0.018 ns/op > DoubleClassCheck.testIsFiniteStore avgt 5 0.480 ? 0.012 0.943 ? 0.012 0.486 ? 0.011 ns/op > DoubleClassCheck.testIsInfiniteBranch avgt 5 1.907 ? 0.081 1.917 ? 0.065 1.808 ? 0.039 ns/op > DoubleClassCheck.testIsInfiniteCMov avgt 5 1.111 ? 0.028 0.982 ? 0.019 0.630 ? 0.017 ns/op > DoubleClassCheck.testIsInfiniteStore avgt 5 1.134 ? 0.011 0.944 ? 0.017 0.630 ? 0.009 ns/op > DoubleClassCheck.testIsNaNBranch avgt 5 1.926 ? 0.218 2.193 ? 0.045 1.767 ? 0.142 ns/op > DoubleClassCheck.testIsNaNCMov avgt 5 2.944 ? 0.020 1.047 ? 0.012 0.392 ? 0.009 ns/op > DoubleClassCheck.testIsNaNStore avgt 5 3.011 ? 0.065 0.946 ? 0.029 0.411 ? 0.004 ns/op > ``` > > The source code for `FloatClassCheck`, that of `DoubleClassCheck` is similar > > ``` > RandomGenerator rng; > static final int BUFFER_SIZE = 1024; > float[] inputs; > boolean[] storeOutputs; > int[] cmovOutputs; > int[] branchOutputs; > > @CompilerControl(CompilerControl.Mode.DONT_INLINE) > static int call() { > return 1; > } > > @Setup > public void setup() { > storeOutputs = new boolean[BUFFER_SIZE]; > cmovOutputs = new int[BUFFER_SIZE]; > branchOutputs = new int[BUFFER_SIZE]; > inputs = new float[BUFFER_SIZE]; > RandomGenerator rng = RandomGeneratorFactory.getDefault().create(0); > float input; > for (int i = 0; i < BUFFER_SIZE; i++) { > if (i % 5 == 0) { > input = (i%2 == 0) ? Float.NEGATIVE_INFINITY : Float.POSITIVE_INFINITY; > } > else if (i % 3 == 0) input = Float.NaN; > else input = rng.nextFloat(); > inputs[i] = input; > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsFiniteStore() { > for (int i = 0; i < BUFFER_SIZE; i++) { > storeOutputs[i] = Float.isFinite(inputs[i]); > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsInfiniteStore() { > for (int i = 0; i < BUFFER_SIZE; i++) { > storeOutputs[i] = Float.isInfinite(inputs[i]); > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsNaNStore() { > for (int i = 0; i < BUFFER_SIZE; i++) { > storeOutputs[i] = Float.isNaN(inputs[i]); > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsFiniteCMov() { > for (int i = 0; i < BUFFER_SIZE; i++) { > cmovOutputs[i] = Float.isFinite(inputs[i]) ? 9 : 7; > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsInfiniteCMov() { > for (int i = 0; i < BUFFER_SIZE; i++) { > cmovOutputs[i] = Float.isInfinite(inputs[i]) ? 9 : 7; > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsNaNCMov() { > for (int i = 0; i < BUFFER_SIZE; i++) { > cmovOutputs[i] = Float.isNaN(inputs[i]) ? 9 : 7; > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsFiniteBranch() { > for (int i = 0; i < BUFFER_SIZE; i++) { > cmovOutputs[i] = Float.isFinite(inputs[i]) ? call() : 7; > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsInfiniteBranch() { > for (int i = 0; i < BUFFER_SIZE; i++) { > cmovOutputs[i] = Float.isInfinite(inputs[i]) ? call() : 7; > } > } > > @Benchmark > @OperationsPerInvocation(BUFFER_SIZE) > public void testIsNaNBranch() { > for (int i = 0; i < BUFFER_SIZE; i++) { > cmovOutputs[i] = Float.isNaN(inputs[i]) ? call() : 7; > } > } > ``` > > The assembly of the interesting parts of the executions: > > ``` > FloatClassCheck::testIsFiniteBranch: > Baseline, #8525: > vandps -0xd4791(%rip), %xmm0, %xmm1 > vucomiss %xmm1, %xmm2 > jae -0x77 > > #8459 > vmovd %xmm0, %r11d > andl $0x7fffffff, %r11d > cmpl $0x7f800000, %r11d > setb %r10b > andl $0xff, %r10d > testl %r10d, %r10d > jne -0x90 > > FloatClassCheck::testIsFiniteCMov: > Baseline, #8525: > vandps -0xcdcf6(%rip), %xmm6, %xmm7 > vucomiss %xmm7, %xmm10 > movl $0x9, %r10d > cmovbl %r14d, %r10d > > #8459: > vmovd %xmm4, %r9d > andl $0x7fffffff, %r9d > cmpl $0x7f800000, %r9d > setb %r8b > andl $0xff, %r8d > testl %r8d, %r8d > movl $0x7, %ebx > cmovnel %r14d, %ebx > > FloatClassCheck::isFiniteStore: > Baseline, #8525: > vandps -0xcfd74(%rip), %xmm3, %xmm3 > movl $0x1, %r10d > vucomiss %xmm3, %xmm0 > cmovbl %r9d, %r10d > > #8459: > vmovd %xmm6, %edi > andl $0x7fffffff, %edi > cmpl $0x7f800000, %edi > setb %dil > andl $0xff, %edi > > FloatClassCheck::isInfiniteBranch: > Baseline: > vucomiss -0xc8(%rip), %xmm1 > jp 0x2 > je 0x20 > vucomiss -0xd0(%rip), %xmm1 > nopl (%rax,%rax) > nop > jp -0x86 > jne -0x8c > > #8459: > vmovd %xmm1, %r10d > andl $0x7fffffff, %r10d > cmpl $0x7f800000, %r10d > sete %r11b > andl $0xff, %r11d > testl %r11d, %r11d > je -0x87 > > #8525: > vandps -0xce478(%rip), %xmm1, %xmm0 > nopl (%rax,%rax) > vucomiss -0xc8(%rip), %xmm0 > jbe -0x76 > > FloatClassCheck::isInfiniteCMov: > Baseline: > vucomiss -0x128(%rip), %xmm1 > jp 0x2 > je 0x16 > vucomiss -0x130(%rip), %xmm1 > jp 0x2 > je 0xa > > #8459: > vmovd %xmm5, %eax > andl $0x7fffffff, %eax > cmpl $0x7f800000, %eax > sete %bpl > andl $0xff, %ebp > testl %ebp, %ebp > movl $0x7, %eax > cmovnel %ebx, %eax > > #8525: > vandps -0xcefc3(%rip), %xmm0, %xmm0 > vucomiss -0x12b(%rip), %xmm0 > movl $0x9, %esi > cmovbel %r8d, %esi > > FloatClassCheck::isInfiniteStore: > Baseline: > vucomiss -0x128(%rip), %xmm0 > jp 0x2 > je 0x11 > vucomiss -0x130(%rip), %xmm0 > jp 0x2 > je 0x5 > > #8459: > vmovd %xmm2, %r8d > andl $0x7fffffff, %r8d > cmpl $0x7f800000, %r8d > sete %r8b > andl $0xff, %r8d > > #8525: > vandps -0xcf2b9(%rip), %xmm0, %xmm0 > vucomiss -0x121(%rip), %xmm0 > movl $0x1, %r11d > cmovbel %esi, %r11d > > FloatClassCheck::isNaNBranch: > Baseline: > vucomiss %xmm0, %xmm0 > jp 0x2 > je -0x64 > > #8459: > vmovd %xmm1, %r10d > andl $0x7fffffff, %r10d > cmpl $0x7f800000, %r10d > seta %r11b > andl $0xff, %r11d > testl %r11d, %r11d > je -0x87 > > #8525: > vucomiss %xmm1, %xmm1 > jnp -0x62 > > FloatClassCheck::isNaNCMov: > Baseline: > vucomiss %xmm5, %xmm5 > jnp 0xa > pushfq > andq $-0xd5, (%rsp) > popfq > movl $0x7, %r9d > cmovnel %r8d, %r9d > > #8459: > vmovd %xmm4, %ebp > andl $0x7fffffff, %ebp > cmpl $0x7f800000, %ebp > seta %al > andl $0xff, %eax > testl %eax, %eax > movl $0x7, %ebp > cmovnel %ebx, %ebp > > #8525: > vucomiss %xmm4, %xmm4 > movl $0x7, %r9d > cmovpl %r8d, %r9d > > FloatClassCheck::isNaNStore: > Baseline: > vucomiss %xmm3, %xmm3 > jnp 0xa > pushfq > andq $-0xd5, (%rsp) > popfq > movl $0x1, %ebx > cmovel %eax, %ebx > > #8459: > vmovd %xmm6, %edi > andl $0x7fffffff, %edi > cmpl $0x7f800000, %edi > seta %dil > andl $0xff, %edi > > #8525: > movl $0x1, %r9d > vucomiss %xmm0, %xmm0 > cmovnpl %r10d, %r9d > ``` > > The assembly output for `DoubleClassCheck` is similar. Thanks. Thanks for sharing the performance data. Your patch is showing` ~2.5x` improvement over the intrinsic for the case of `{Float/Double}.testIsNaNCMov`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From rcastanedalo at openjdk.java.net Mon May 23 06:30:49 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 23 May 2022 06:30:49 GMT Subject: RFR: 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure In-Reply-To: References: Message-ID: On Fri, 20 May 2022 16:03:21 GMT, Vladimir Kozlov wrote: > I agree with suggested change. Thanks for reviewing, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8805 From xgong at openjdk.java.net Mon May 23 09:04:02 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 23 May 2022 09:04:02 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> Message-ID: On Fri, 20 May 2022 07:12:59 GMT, Haomin wrote: >> yes, just cherry-pick my patch. And I just build the latest vectorIntrinsics branch, also met the error. > > - CFLAGS_WARNINGS_ARE_ERRORS="-Werror" > + CFLAGS_WARNINGS_ARE_ERRORS="" > ... > - JAVA_WARNINGS_ARE_ERRORS ?= -Werror > + JAVA_WARNINGS_ARE_ERRORS ?= > ... > > with the dirty diff, I have tested Byte128Vector. > Yes, the Score about rotate is lower than before my patch. > Could you give me some suggestions? Since `"VectorNode::implemented"` is only used for auto-vect, I think we can do the check in `"VectorNode::is_vector_rotate_supported" `and return false for byte and short type. ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From xgong at openjdk.java.net Mon May 23 09:04:02 2022 From: xgong at openjdk.java.net (Xiaohong Gong) Date: Mon, 23 May 2022 09:04:02 GMT Subject: RFR: 8286847: Rotate vectors don't support byte or short [v3] In-Reply-To: References: <9zUsdzzbL3abbWSfrBcU3pqp8sPIJhP2B08nKAmMBZE=.381b9c9e-5111-45c5-aa2a-fd3ce502d6b6@github.com> Message-ID: On Mon, 23 May 2022 08:58:19 GMT, Xiaohong Gong wrote: >> - CFLAGS_WARNINGS_ARE_ERRORS="-Werror" >> + CFLAGS_WARNINGS_ARE_ERRORS="" >> ... >> - JAVA_WARNINGS_ARE_ERRORS ?= -Werror >> + JAVA_WARNINGS_ARE_ERRORS ?= >> ... >> >> with the dirty diff, I have tested Byte128Vector. >> Yes, the Score about rotate is lower than before my patch. >> Could you give me some suggestions? > > Since `"VectorNode::implemented"` is only used for auto-vect, I think we can do the check in `"VectorNode::is_vector_rotate_supported" `and return false for byte and short type. Or you can try to add the right support for byte/short for some cases like https://github.com/openjdk/jdk/pull/7979 ------------- PR: https://git.openjdk.java.net/jdk/pull/8740 From jbhateja at openjdk.java.net Mon May 23 09:27:04 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 23 May 2022 09:27:04 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v6] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Removing CompressExpandSanityTest from problem list. - 8283894: Updating test tag spec. - 8283894: Review comments resolved. - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. - 8283894: Review comments resolutions. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Extending IR framework testcase with some functional test points. - 8283894: Intrinsify compress and expand bits on x86 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/666e8589..6bb6d343 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=04-05 Stats: 19773 lines in 673 files changed: 9031 ins; 6800 del; 3942 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From duke at openjdk.java.net Mon May 23 11:19:39 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Mon, 23 May 2022 11:19:39 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 15:12:59 GMT, Vladimir Kozlov wrote: > Having > 40 cycles in loopopts is bug - something wrong gone there. That is what this assert for. We should detect such case in loop opts and stop early. > I don't think removing this assert is correct thing. Thanks for your comment! I have attached the the output of `TestMaxLoopOptsCountReached.java` run with `-XX:+TraceLoopOpts` in JIRA: https://bugs.openjdk.java.net/secure/attachment/99052/Output%20TraceLoopOpts.txt It does a lot of loop unswitching but it seems that all loop optimizations are valid. The way the assert is implemented it can only be triggered if `PartialPeelLoop` is turned off. When `PartialPeelLoop` is turned on the assert can never be triggered and the loop optimizations are stopped after `LoopOptsCount` iterations (see PR description). I don't think it necessarily means that there is a bug if `LoopOptsCount` many loop optimizations are performed. I agree that this could help detect potential bugs, but there can also be false positives (e.g. TestMaxLoopOptsCountReached.java). I am not sure if there is a reliable way to detect such false positives. What do you think? ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From duke at openjdk.java.net Mon May 23 12:07:06 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 12:07:06 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh Message-ID: Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` ------------- Commit messages: - 8287139: aarch64 intrinsic for unsignedMultiplyHigh Changes: https://git.openjdk.java.net/jdk/pull/8840/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8840&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287139 Stats: 16 lines in 1 file changed: 16 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8840.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8840/head:pull/8840 PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 12:07:08 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 12:07:08 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 11:59:02 GMT, Raffaello Giulietti wrote: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` Verified with hsdis 0x00000001163fa830: umulh x10, x10, x10 ;*invokestatic unsignedMultiplyHigh {reexecute=0 rethrow=0 return_oop=0} ; - Umul::main at 17 (line 5) ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From aph at openjdk.java.net Mon May 23 12:12:36 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 23 May 2022 12:12:36 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 11:59:02 GMT, Raffaello Giulietti wrote: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` Marked as reviewed by aph (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 12:15:43 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 12:15:43 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 12:09:00 GMT, Andrew Haley wrote: >> Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` > > Marked as reviewed by aph (Reviewer). @theRealAph Thanks Andrew, will wait 24 hours for all timezones reviewers. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From ngasson at openjdk.java.net Mon May 23 12:46:49 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Mon, 23 May 2022 12:46:49 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 11:59:02 GMT, Raffaello Giulietti wrote: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` Looks fine apart from the tiny cosmetic issue. Did you try the `MathBench.unsignedMultiplyHighLongLong` JMH benchmark introduced with the x86 implementation? src/hotspot/cpu/aarch64/aarch64.ad line 11151: > 11149: > 11150: ins_cost(INSN_COST * 7); > 11151: format %{ "umulh $dst, $src1, $src2, \t# umulhi" %} There's an unnecessary trailing comma here and in the `smulh` pattern above. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From njian at openjdk.java.net Mon May 23 12:53:47 2022 From: njian at openjdk.java.net (Ningsheng Jian) Date: Mon, 23 May 2022 12:53:47 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 11:59:02 GMT, Raffaello Giulietti wrote: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` src/hotspot/cpu/aarch64/aarch64.ad line 11146: > 11144: %} > 11145: > 11146: instruct umulHiL_rReg(iRegLNoSp dst, iRegL src1, iRegL src2, rFlagsReg cr) `rFlagsReg cr` is not used. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 13:13:44 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 13:13:44 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 12:41:25 GMT, Nick Gasson wrote: >> Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` > > src/hotspot/cpu/aarch64/aarch64.ad line 11151: > >> 11149: >> 11150: ins_cost(INSN_COST * 7); >> 11151: format %{ "umulh $dst, $src1, $src2, \t# umulhi" %} > > There's an unnecessary trailing comma here and in the `smulh` pattern above. I didn't run the benchmark. Isn't a single instruction supposed to be faster anyway, as opposed to the java method? ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 13:25:40 2022 From: duke at openjdk.java.net (kristylee88) Date: Mon, 23 May 2022 13:25:40 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 11:59:02 GMT, Raffaello Giulietti wrote: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` Marked as reviewed by kristylee88 at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 13:25:40 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 13:25:40 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 12:50:28 GMT, Ningsheng Jian wrote: >> Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` > > src/hotspot/cpu/aarch64/aarch64.ad line 11146: > >> 11144: %} >> 11145: >> 11146: instruct umulHiL_rReg(iRegLNoSp dst, iRegL src1, iRegL src2, rFlagsReg cr) > > `rFlagsReg cr` is not used. Yes, I noticed, but since there are many more unused `rFlagsReg cr` in the file and since I don't know its semantics, I prefer to just leave it as it is. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 14:34:42 2022 From: duke at openjdk.java.net (Bhavana-Kilambi) Date: Mon, 23 May 2022 14:34:42 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: On Mon, 23 May 2022 13:10:52 GMT, Raffaello Giulietti wrote: >> src/hotspot/cpu/aarch64/aarch64.ad line 11151: >> >>> 11149: >>> 11150: ins_cost(INSN_COST * 7); >>> 11151: format %{ "umulh $dst, $src1, $src2, \t# umulhi" %} >> >> There's an unnecessary trailing comma here and in the `smulh` pattern above. > > I didn't run the benchmark. > Isn't a single instruction supposed to be faster anyway, as opposed to the java method? Hi, I tried the JMH benchmark with this patch, it does generate the "umulh" instruction instead of smulh. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 14:34:42 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 14:34:42 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: <_d4FVxEB4Z96B1dz4xAW-xJi_y6rBTFPkuB5s_K1glU=.9eccd7d4-7813-4b5d-9b0a-b506f1bbf864@github.com> On Mon, 23 May 2022 14:28:00 GMT, Bhavana-Kilambi wrote: >> I didn't run the benchmark. >> Isn't a single instruction supposed to be faster anyway, as opposed to the java method? > > Hi, I tried the JMH benchmark with this patch, it does generate the "umulh" instruction instead of smulh. Thanks @Bhavana-Kilambi, as the `hsdis` output above shows, this is indeed the case. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Mon May 23 14:56:19 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Mon, 23 May 2022 14:56:19 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh [v2] In-Reply-To: References: Message-ID: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` Raffaello Giulietti has updated the pull request incrementally with one additional commit since the last revision: 8287139: aarch64 intrinsic for unsignedMultiplyHigh ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8840/files - new: https://git.openjdk.java.net/jdk/pull/8840/files/14905156..07226edf Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8840&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8840&range=00-01 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8840.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8840/head:pull/8840 PR: https://git.openjdk.java.net/jdk/pull/8840 From aph at openjdk.java.net Mon May 23 15:00:29 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Mon, 23 May 2022 15:00:29 GMT Subject: RFR: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn Message-ID: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> This is fallout from the patch for JDK-8285923. The root cause of this bug is that there is a template definition of `cmp(register, immediate)` but there is not a template definition of `cmn(register, immediate)`. Given that we are close to rampdown, this patch fixes the bug in the most minimal way possible, by using `adds(zr, register, immediate)`, which correctly handles 64-bit operands. In the next release cycle we should tidy up `cmn()` in the same way that was done for JDK-8206895. Alternatively, we could back out JDK-8285923. I'd rather not, given that it fixes a real (if latent) bug, but if needs be I'll do so. ------------- Commit messages: - 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn - 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn Changes: https://git.openjdk.java.net/jdk/pull/8845/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8845&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287091 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8845.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8845/head:pull/8845 PR: https://git.openjdk.java.net/jdk/pull/8845 From sviswanathan at openjdk.java.net Mon May 23 15:31:03 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 23 May 2022 15:31:03 GMT Subject: Integrated: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 In-Reply-To: References: Message-ID: On Wed, 18 May 2022 17:25:38 GMT, Sandhya Viswanathan wrote: > This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. > This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). > > The performance numbers are as follows: > Before: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s > > After: > Benchmark (count) Mode Cnt Score Error Units > IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s > IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s > IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s > > Please review. > > Best Regards, > Sandhya This pull request has now been integrated. Changeset: 5d8d6da3 Author: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk/commit/5d8d6da36aeb3bd4f6238cfac509d0e481fa5d1e Stats: 270 lines in 4 files changed: 248 ins; 21 del; 1 mod 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 Reviewed-by: kvn, jbhateja ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From ngasson at openjdk.java.net Mon May 23 15:35:59 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Mon, 23 May 2022 15:35:59 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh [v2] In-Reply-To: References: Message-ID: On Mon, 23 May 2022 14:56:19 GMT, Raffaello Giulietti wrote: >> Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` > > Raffaello Giulietti has updated the pull request incrementally with one additional commit since the last revision: > > 8287139: aarch64 intrinsic for unsignedMultiplyHigh Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From jbhateja at openjdk.java.net Mon May 23 16:25:02 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 23 May 2022 16:25:02 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v6] In-Reply-To: References: Message-ID: On Mon, 2 May 2022 16:12:41 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Removing CompressExpandSanityTest from problem list. >> - 8283894: Updating test tag spec. >> - 8283894: Review comments resolved. >> - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. >> - 8283894: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Extending IR framework testcase with some functional test points. >> - 8283894: Intrinsify compress and expand bits on x86 > > Can you update the jtreg tests: > 1. Modify `CompressExpandTest` to run with and without the intrinsic enabled > 2. Disable (by default) `CompressExpandSanityTest` > ? Hi @PaulSandoz , @rose00 , your comments have been addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Mon May 23 16:28:23 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 23 May 2022 16:28:23 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v3] In-Reply-To: References: <15GChtdthFmu9Cup-Ykj5NBvAanOC8QOJsnhH9g20KY=.f35eba31-15f9-40e8-95ce-a54049792840@github.com> Message-ID: On Thu, 12 May 2022 23:56:49 GMT, Vladimir Ivanov wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: Correcting a typo. >> - 8284960: Integrating changes from panama-vector (Add @since 19 tags). >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - 8284960: AARCH64 backend changes. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 >> - ... and 1 more: https://git.openjdk.java.net/jdk/compare/3fa1c404...b021e082 > > Overall, looks good. > > Some minor questions/suggestions follow. Hi @iwanowww , your comments have been addressed. kindly let me know if you have other comments on x86 side changes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From shade at openjdk.java.net Mon May 23 17:10:13 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 23 May 2022 17:10:13 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 Message-ID: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. Attn @dean-long. Testing: - [x] Affected test on Linux x86_64 fastdebug (still passes) - [x] Affected test on Linux x86_32 fastdebug (now passes) ------------- Commit messages: - Fix Changes: https://git.openjdk.java.net/jdk/pull/8851/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8851&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287169 Stats: 4 lines in 1 file changed: 2 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8851.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8851/head:pull/8851 PR: https://git.openjdk.java.net/jdk/pull/8851 From kvn at openjdk.java.net Mon May 23 18:15:50 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 23 May 2022 18:15:50 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 In-Reply-To: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: On Mon, 23 May 2022 17:03:38 GMT, Aleksey Shipilev wrote: > See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. > > Attn @dean-long. > > Testing: > - [x] Affected test on Linux x86_64 fastdebug (still passes) > - [x] Affected test on Linux x86_32 fastdebug (now passes) src/hotspot/share/compiler/compilerDefinitions.cpp line 139: > 137: int exp; > 138: (void) frexp(v, &exp); > 139: if (exp > LP64_ONLY(63) NOT_LP64(31)) { How about `(exp > (sizeof(intx)*BitsPerByte-1))`? ------------- PR: https://git.openjdk.java.net/jdk/pull/8851 From duke at openjdk.java.net Mon May 23 18:18:39 2022 From: duke at openjdk.java.net (aamarsh) Date: Mon, 23 May 2022 18:18:39 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v15] In-Reply-To: References: Message-ID: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> > Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: > > > No escape = 263, Arg escape = 87, Global escape = 1628 > Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 aamarsh has updated the pull request incrementally with two additional commits since the last revision: - delete iterative EA comment - account for iterative EA ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8019/files - new: https://git.openjdk.java.net/jdk/pull/8019/files/8c394555..fe3448eb Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=14 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8019&range=13-14 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8019.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8019/head:pull/8019 PR: https://git.openjdk.java.net/jdk/pull/8019 From shade at openjdk.java.net Mon May 23 18:28:45 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 23 May 2022 18:28:45 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 In-Reply-To: References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: On Mon, 23 May 2022 18:12:23 GMT, Vladimir Kozlov wrote: >> See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. >> >> Attn @dean-long. >> >> Testing: >> - [x] Affected test on Linux x86_64 fastdebug (still passes) >> - [x] Affected test on Linux x86_32 fastdebug (now passes) > > src/hotspot/share/compiler/compilerDefinitions.cpp line 139: > >> 137: int exp; >> 138: (void) frexp(v, &exp); >> 139: if (exp > LP64_ONLY(63) NOT_LP64(31)) { > > How about `(exp > (sizeof(intx)*BitsPerByte-1))`? `(exp > BitsPerWord-1)` then? ------------- PR: https://git.openjdk.java.net/jdk/pull/8851 From kvn at openjdk.java.net Mon May 23 18:51:45 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 23 May 2022 18:51:45 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 In-Reply-To: References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: On Mon, 23 May 2022 18:25:24 GMT, Aleksey Shipilev wrote: >> src/hotspot/share/compiler/compilerDefinitions.cpp line 139: >> >>> 137: int exp; >>> 138: (void) frexp(v, &exp); >>> 139: if (exp > LP64_ONLY(63) NOT_LP64(31)) { >> >> How about `(exp > (sizeof(intx)*BitsPerByte-1))`? > > `(exp > BitsPerWord-1)` then? It is not clear how `intx` relates to `BitsPerWord`? I would prefer `sizeof(intx)` in expression. ------------- PR: https://git.openjdk.java.net/jdk/pull/8851 From shade at openjdk.java.net Mon May 23 19:12:27 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 23 May 2022 19:12:27 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 [v2] In-Reply-To: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: > See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. > > Attn @dean-long. > > Testing: > - [x] Affected test on Linux x86_64 fastdebug (still passes) > - [x] Affected test on Linux x86_32 fastdebug (now passes) Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: Review comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8851/files - new: https://git.openjdk.java.net/jdk/pull/8851/files/57f7e7a6..e82eea34 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8851&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8851&range=00-01 Stats: 2 lines in 1 file changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8851.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8851/head:pull/8851 PR: https://git.openjdk.java.net/jdk/pull/8851 From shade at openjdk.java.net Mon May 23 19:12:28 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Mon, 23 May 2022 19:12:28 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 [v2] In-Reply-To: References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: On Mon, 23 May 2022 18:48:30 GMT, Vladimir Kozlov wrote: >> `(exp > BitsPerWord-1)` then? > > It is not clear how `intx` relates to `BitsPerWord`? I would prefer `sizeof(intx)` in expression. Right. See new commit, I introduced a variable to implicitly cast to `int` and avoid signed-unsigned comparison. ------------- PR: https://git.openjdk.java.net/jdk/pull/8851 From kvn at openjdk.java.net Mon May 23 19:23:55 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 23 May 2022 19:23:55 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 [v2] In-Reply-To: References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: On Mon, 23 May 2022 19:12:27 GMT, Aleksey Shipilev wrote: >> See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. >> >> Attn @dean-long. >> >> Testing: >> - [x] Affected test on Linux x86_64 fastdebug (still passes) >> - [x] Affected test on Linux x86_32 fastdebug (now passes) > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Review comments Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8851 From sviswanathan at openjdk.java.net Mon May 23 20:11:49 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Mon, 23 May 2022 20:11:49 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3] In-Reply-To: References: Message-ID: <5iVxsWodIft_Qbj8ZKCGmz-M4JFQebYwApckPrr5C6o=.77ea11c8-345a-477b-83f0-b1e5da16b7db@github.com> On Sat, 21 May 2022 10:31:25 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 >> >> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from >> >> ucomiss xmm0, xmm0 >> jp label >> jne label >> >> into >> >> ucomiss xmm0, xmm0 >> jp label >> >> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> jnp done >> pushf >> andq [rsp], 0xffffff2b >> popf >> done: >> movl eax, 1 >> cmovel eax, ecx >> >> The patch changes this sequence into >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> movl eax, 1 >> cmovpl eax, ecx >> cmovnel eax, ecx >> >> 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. >> >> The benchmark results are as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Unit Ratio >> FPComparison.equalDouble avgt 5 2876.242 ? 58.875 594.636 ? 8.922 ns/op 4.84 >> FPComparison.equalFloat avgt 5 3062.430 ? 31.371 663.849 ? 3.656 ns/op 4.61 >> FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 518.309 ? 107.352 ns/op 0.92 >> FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 515.576 ? 14.669 ns/op 0.98 >> FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 621.185 ? 11.935 ns/op 1.98 >> FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 623.566 ? 15.206 ns/op 1.98 >> FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 400.124 ? 0.762 ns/op 5.64 >> FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 546.486 ? 1.509 ns/op 4.70 >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > comments Marked as reviewed by sviswanathan (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From dlong at openjdk.java.net Mon May 23 20:39:55 2022 From: dlong at openjdk.java.net (Dean Long) Date: Mon, 23 May 2022 20:39:55 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 [v2] In-Reply-To: References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: On Mon, 23 May 2022 19:12:27 GMT, Aleksey Shipilev wrote: >> See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. >> >> Attn @dean-long. >> >> Testing: >> - [x] Affected test on Linux x86_64 fastdebug (still passes) >> - [x] Affected test on Linux x86_32 fastdebug (now passes) > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Review comments Thanks for fixing this. Sorry for the breakage. ------------- Marked as reviewed by dlong (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8851 From vlivanov at openjdk.java.net Mon May 23 22:09:27 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Mon, 23 May 2022 22:09:27 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> Message-ID: On Fri, 20 May 2022 09:51:24 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8284960: Integrating incremental patches. Looks good! ------------- Marked as reviewed by vlivanov (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8425 From psandoz at openjdk.java.net Mon May 23 22:32:53 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Mon, 23 May 2022 22:32:53 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v6] In-Reply-To: References: Message-ID: On Mon, 23 May 2022 09:27:04 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Removing CompressExpandSanityTest from problem list. > - 8283894: Updating test tag spec. > - 8283894: Review comments resolved. > - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. > - 8283894: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending IR framework testcase with some functional test points. > - 8283894: Intrinsify compress and expand bits on x86 test/jdk/java/lang/CompressExpandSanityTest.java line 29: > 27: * @key randomness > 28: * @run testng/othervm -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_expand_i,_expand_l,_compress_i,_compress_l CompressExpandSanityTest > 29: * @run testng CompressExpandSanityTest Can we comment out the annotations so this test is not run by default (i don't know of a better way) ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From kvn at openjdk.java.net Mon May 23 23:11:36 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 23 May 2022 23:11:36 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> Message-ID: On Fri, 20 May 2022 09:51:24 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8284960: Integrating incremental patches. src/hotspot/cpu/x86/assembler_x86.cpp line 7934: > 7932: > 7933: void Assembler::evplzcntd(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) { > 7934: assert(VM_Version::supports_avx512cd() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); Please, split assert as in other instructions - it will help to understand failure better. src/hotspot/cpu/x86/assembler_x86.cpp line 7946: > 7944: > 7945: void Assembler::evplzcntq(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) { > 7946: assert(VM_Version::supports_avx512cd() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); Split assert. src/hotspot/cpu/x86/assembler_x86.cpp line 8173: > 8171: > 8172: void Assembler::vinsertf32x4(XMMRegister dst, XMMRegister nds, XMMRegister src, uint8_t imm8) { > 8173: assert(VM_Version::supports_evex(), ""); Hmm, did we never trigger this wrong assert because the use was guarded by correct check? src/hotspot/cpu/x86/assembler_x86.cpp line 11720: > 11718: > 11719: void Assembler::evpcompressb(XMMRegister dst, KRegister mask, XMMRegister src, bool merge, int vector_len) { > 11720: assert(VM_Version::supports_avx512_vbmi2() && (vector_len == AVX_512bit || VM_Version::supports_avx512vl()), ""); Split assert in this and following new instructions. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4455: > 4453: break; > 4454: default: > 4455: fatal("Unsupported type"); Print wrong type: `fatal("Unsupported type : %s", type2name(type));` Below too. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4561: > 4559: case 4 : evpbroadcastd(dst, rtmp, vec_enc); break; > 4560: case 8 : evpbroadcastq(dst, rtmp, vec_enc); break; > 4561: default : ShouldNotReachHere(); break; `ShouldNotReachHere` does not give any information in case of failure. Use `fatal()` which prints wrong `lane_size`. Same below. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4666: > 4664: break; > 4665: default: > 4666: ShouldNotReachHere(); Use `fatal()`. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4693: > 4691: break; > 4692: default: > 4693: ShouldNotReachHere(); Use `fatal()`. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4732: > 4730: vector_reverse_byte(bt, dst, xtmp2, rtmp, vec_enc); > 4731: > 4732: } else if(!VM_Version::supports_avx512vlbw() && vec_enc == Assembler::AVX_512bit) { No need to check `!VM_Version::supports_avx512vlbw()`. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4759: > 4757: vpandn(xtmp2, xtmp2, xtmp1, vec_enc); > 4758: vpsrlq(xtmp2, xtmp2, 1, vec_enc); > 4759: vporq(xtmp1, dst, xtmp2, vec_enc); All 3 code snippets are the same except constants. Also similar code in `vector_reverse_byte64` for `short` type. Consider factoring out it into separate method. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4819: > 4817: break; > 4818: default: > 4819: fatal("Unsupported type"); Print wring type. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4846: > 4844: break; > 4845: default: > 4846: fatal("Unsupported type"); Print wring type. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 4892: > 4890: break; > 4891: default: > 4892: ShouldNotReachHere(); Use `fatal` and print type. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5018: > 5016: break; > 5017: default: > 5018: ShouldNotReachHere(); Use fatal and print type. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5037: > 5035: break; > 5036: default: > 5037: ShouldNotReachHere(); Use fatal and print type. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 5056: > 5054: break; > 5055: default: > 5056: ShouldNotReachHere(); Use fatal and print type. src/hotspot/cpu/x86/x86.ad line 8879: > 8877: // special handling should be removed. > 8878: if (bt == T_LONG && rbt == T_INT) { > 8879: if (VM_Version::supports_avx512vl()) { Predicate say `!VM_Version::supports_avx512vl()` src/hotspot/share/opto/node.hpp line 1006: > 1004: > 1005: // The node is a CountedLoopEnd with a mask annotation so as to emit a restore context > 1006: bool has_vector_mask_set() const { return (_flags & Flag_has_vector_mask_set) != 0; } I don't see use of this flag. src/hotspot/share/opto/vectorIntrinsics.cpp line 86: > 84: if ((mask_use_type & VecMaskUseLoad) != 0) { > 85: if (!Matcher::match_rule_supported_vector(Op_VectorLoadMask, num_elem, elem_bt) || > 86: !Matcher::match_rule_supported_vector(Op_LoadVector, num_elem, T_BOOLEAN)) { Add comment explaining new check. In follow ing places too. src/hotspot/share/runtime/vmStructs.cpp line 1779: > 1777: declare_c2_type(CMoveVDNode, VectorNode) \ > 1778: declare_c2_type(CompressVNode, VectorNode) \ > 1779: declare_c2_type(ExpandVNode, VectorNode) \ Not all new nodes listed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From kvn at openjdk.java.net Mon May 23 23:44:05 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Mon, 23 May 2022 23:44:05 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 There were failure in testing. I reported it in JBS. ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From duke at openjdk.java.net Mon May 23 23:49:41 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Mon, 23 May 2022 23:49:41 GMT Subject: RFR: 8285973: x86_64: Improve fp comparison and cmove for eq/ne [v3] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 10:31:25 GMT, Quan Anh Mai wrote: >> Hi, >> >> This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 >> >> 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from >> >> ucomiss xmm0, xmm0 >> jp label >> jne label >> >> into >> >> ucomiss xmm0, xmm0 >> jp label >> >> 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> jnp done >> pushf >> andq [rsp], 0xffffff2b >> popf >> done: >> movl eax, 1 >> cmovel eax, ecx >> >> The patch changes this sequence into >> >> xorl ecx, ecx >> ucomiss xmm0, xmm1 >> movl eax, 1 >> cmovpl eax, ecx >> cmovnel eax, ecx >> >> 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. >> >> The benchmark results are as follow: >> >> Before After >> Benchmark Mode Cnt Score Error Score Error Unit Ratio >> FPComparison.equalDouble avgt 5 2876.242 ? 58.875 594.636 ? 8.922 ns/op 4.84 >> FPComparison.equalFloat avgt 5 3062.430 ? 31.371 663.849 ? 3.656 ns/op 4.61 >> FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 518.309 ? 107.352 ns/op 0.92 >> FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 515.576 ? 14.669 ns/op 0.98 >> FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 621.185 ? 11.935 ns/op 1.98 >> FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 623.566 ? 15.206 ns/op 1.98 >> FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 400.124 ? 0.762 ns/op 5.64 >> FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 546.486 ? 1.509 ns/op 4.70 >> >> Thank you very much. > > Quan Anh Mai has updated the pull request incrementally with one additional commit since the last revision: > > comments Thank you very much for your reviews and testing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From duke at openjdk.java.net Tue May 24 00:18:42 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Tue, 24 May 2022 00:18:42 GMT Subject: Integrated: 8285973: x86_64: Improve fp comparison and cmove for eq/ne In-Reply-To: References: Message-ID: On Wed, 4 May 2022 01:59:17 GMT, Quan Anh Mai wrote: > Hi, > > This patch optimises the matching rules for floating-point comparison with respects to eq/ne on x86-64 > > 1, When the inputs of a comparison is the same (i.e `isNaN` patterns), `ZF` is always set, so we don't need `cmpOpUCF2` for the eq/ne cases, which improves the sequence of `If (CmpF x x) (Bool ne)` from > > ucomiss xmm0, xmm0 > jp label > jne label > > into > > ucomiss xmm0, xmm0 > jp label > > 2, The move rules for `cmpOpUCF2` is missing, which makes patterns such as `x == y ? 1 : 0` to fall back to `cmpOpU`, which have a really high cost of fixing the flags, such as > > xorl ecx, ecx > ucomiss xmm0, xmm1 > jnp done > pushf > andq [rsp], 0xffffff2b > popf > done: > movl eax, 1 > cmovel eax, ecx > > The patch changes this sequence into > > xorl ecx, ecx > ucomiss xmm0, xmm1 > movl eax, 1 > cmovpl eax, ecx > cmovnel eax, ecx > > 3, The patch also changes the pattern of `isInfinite` to be more optimised by using `Math.abs` to reduce 1 comparison and compares the result with `MAX_VALUE` since `>` is more optimised than `==` for floating-point types. > > The benchmark results are as follow: > > Before After > Benchmark Mode Cnt Score Error Score Error Unit Ratio > FPComparison.equalDouble avgt 5 2876.242 ? 58.875 594.636 ? 8.922 ns/op 4.84 > FPComparison.equalFloat avgt 5 3062.430 ? 31.371 663.849 ? 3.656 ns/op 4.61 > FPComparison.isFiniteDouble avgt 5 475.749 ? 19.027 518.309 ? 107.352 ns/op 0.92 > FPComparison.isFiniteFloat avgt 5 506.525 ? 14.417 515.576 ? 14.669 ns/op 0.98 > FPComparison.isInfiniteDouble avgt 5 1232.800 ? 31.677 621.185 ? 11.935 ns/op 1.98 > FPComparison.isInfiniteFloat avgt 5 1234.708 ? 70.239 623.566 ? 15.206 ns/op 1.98 > FPComparison.isNanDouble avgt 5 2255.847 ? 7.238 400.124 ? 0.762 ns/op 5.64 > FPComparison.isNanFloat avgt 5 2567.044 ? 36.078 546.486 ? 1.509 ns/op 4.70 > > Thank you very much. This pull request has now been integrated. Changeset: c1db70d8 Author: Quan Anh Mai Committer: Sandhya Viswanathan URL: https://git.openjdk.java.net/jdk/commit/c1db70d827f7ac81aa6c6646e2431f672c71c8dc Stats: 704 lines in 4 files changed: 620 ins; 70 del; 14 mod 8285973: x86_64: Improve fp comparison and cmove for eq/ne Reviewed-by: kvn, sviswanathan ------------- PR: https://git.openjdk.java.net/jdk/pull/8525 From kvn at openjdk.java.net Tue May 24 00:22:56 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 00:22:56 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test As I commented in bug report, it took 54 (cnt == 52) iterations to finish compilation TestMaxLoopOptsCountReached::test method. Should we just rise default LoopOptsCount flag's value and limit in the assert? ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From kvn at openjdk.java.net Tue May 24 00:41:51 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 00:41:51 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test The original Test.java did not finish after 1000 iterations and I don't see what loop opts it executed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From kvn at openjdk.java.net Tue May 24 00:48:39 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 00:48:39 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test I think first thing to do is to make sure we got report about what cause `major_progress` to be `true` in those loop opts iteration. Currently I see only repeated output in Test.java. I also found that without "Strip Mining" (I used -XX:+UseParallelGC because I can't find how to switch it off otherwise) the Test.java finished in 9 iterations (cnt == 7) !!! So it is definitely something wrong when we use "Strip Mining" in this test. ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From kvn at openjdk.java.net Tue May 24 01:00:49 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 01:00:49 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test And I don't know how they reproduced failure with JDK 8 and 11. I run it with `java -XX:+TraceLoopOpts -Xcomp -XX:-PartialPeelLoop -XX:+PrintCompilation -XX:+UseParallelGC Test` ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From fjiang at openjdk.java.net Tue May 24 02:40:05 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Tue, 24 May 2022 02:40:05 GMT Subject: RFR: JDK-8287194: build failure on riscv after JDK-8286825 Message-ID: [JDK-8286825](https://bugs.openjdk.java.net/browse/JDK-8286825) made some renaming of hotspot file and method, the following naming changes are missing in riscv: - universalNativeInvoker*--> downcallLinker* - 'native invoker' -> 'downcall stub' Additional testing: - [x] riscv release build ------------- Commit messages: - build failure on riscv after JDK-8286825 Changes: https://git.openjdk.java.net/jdk/pull/8859/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8859&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287194 Stats: 9 lines in 1 file changed: 0 ins; 0 del; 9 mod Patch: https://git.openjdk.java.net/jdk/pull/8859.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8859/head:pull/8859 PR: https://git.openjdk.java.net/jdk/pull/8859 From fyang at openjdk.java.net Tue May 24 02:48:39 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Tue, 24 May 2022 02:48:39 GMT Subject: RFR: JDK-8287194: build failure on riscv after JDK-8286825 In-Reply-To: References: Message-ID: On Tue, 24 May 2022 02:32:04 GMT, Feilong Jiang wrote: > [JDK-8286825](https://bugs.openjdk.java.net/browse/JDK-8286825) made some renaming of hotspot file and method, the following naming changes are missing in riscv: > - universalNativeInvoker*--> downcallLinker* > - 'native invoker' -> 'downcall stub' > > Additional testing: > - [x] riscv release build Looks good. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8859 From duke at openjdk.java.net Tue May 24 04:34:54 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Tue, 24 May 2022 04:34:54 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v2] In-Reply-To: References: Message-ID: > When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , > it shows totally the same message for c1 and c2 compiler and the user would be confused about > which compiler is affected by this message. > This should show messages with their compiler name so that the user knows which compiler shows this message. > > My change result would be like the below. > > > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > -> > > OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: Update full name ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8591/files - new: https://git.openjdk.java.net/jdk/pull/8591/files/e4f4cc84..64256f06 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8591&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8591&range=00-01 Stats: 0 lines in 0 files changed: 0 ins; 0 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8591.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8591/head:pull/8591 PR: https://git.openjdk.java.net/jdk/pull/8591 From xliu at openjdk.java.net Tue May 24 04:59:42 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Tue, 24 May 2022 04:59:42 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v15] In-Reply-To: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> References: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> Message-ID: On Mon, 23 May 2022 18:18:39 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 263, Arg escape = 87, Global escape = 1628 >> Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 > > aamarsh has updated the pull request incrementally with two additional commits since the last revision: > > - delete iterative EA comment > - account for iterative EA LGTM. I am not a reviewer. we need other reviewers to approve this. ------------- Marked as reviewed by xliu (Committer). PR: https://git.openjdk.java.net/jdk/pull/8019 From shade at openjdk.java.net Tue May 24 07:06:40 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 24 May 2022 07:06:40 GMT Subject: RFR: JDK-8287194: build failure on riscv after JDK-8286825 In-Reply-To: References: Message-ID: On Tue, 24 May 2022 02:32:04 GMT, Feilong Jiang wrote: > [JDK-8286825](https://bugs.openjdk.java.net/browse/JDK-8286825) made some renaming of hotspot file and method, the following naming changes are missing in riscv: > - universalNativeInvoker*--> downcallLinker* > - 'native invoker' -> 'downcall stub' > > Additional testing: > - [x] riscv release build Marked as reviewed by shade (Reviewer). Looks fine and trivial. ------------- PR: https://git.openjdk.java.net/jdk/pull/8859Marked as reviewed by shade (Reviewer). From fjiang at openjdk.java.net Tue May 24 07:12:58 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Tue, 24 May 2022 07:12:58 GMT Subject: RFR: JDK-8287194: build failure on riscv after JDK-8286825 In-Reply-To: References: Message-ID: <04YfTN-Csvn11239oIdPOoIZEgDmGqGX1iKrajfhVQE=.64197b52-0dd4-47cf-b9d7-fffd00403ce0@github.com> On Tue, 24 May 2022 02:45:29 GMT, Fei Yang wrote: >> [JDK-8286825](https://bugs.openjdk.java.net/browse/JDK-8286825) made some renaming of hotspot file and method, the following naming changes are missing in riscv: >> - universalNativeInvoker*--> downcallLinker* >> - 'native invoker' -> 'downcall stub' >> >> Additional testing: >> - [x] riscv release build > > Looks good. @RealFYang @shipilev Thanks for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8859 From fjiang at openjdk.java.net Tue May 24 07:15:10 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Tue, 24 May 2022 07:15:10 GMT Subject: Integrated: JDK-8287194: build failure on riscv after JDK-8286825 In-Reply-To: References: Message-ID: On Tue, 24 May 2022 02:32:04 GMT, Feilong Jiang wrote: > [JDK-8286825](https://bugs.openjdk.java.net/browse/JDK-8286825) made some renaming of hotspot file and method, the following naming changes are missing in riscv: > - universalNativeInvoker*--> downcallLinker* > - 'native invoker' -> 'downcall stub' > > Additional testing: > - [x] riscv release build This pull request has now been integrated. Changeset: 1cd7850f Author: Feilong Jiang Committer: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/1cd7850f8745dc92d78e46f11856dd74dd8a66d1 Stats: 9 lines in 1 file changed: 0 ins; 0 del; 9 mod 8287194: build failure on riscv after JDK-8286825 Reviewed-by: fyang, shade ------------- PR: https://git.openjdk.java.net/jdk/pull/8859 From rcastanedalo at openjdk.java.net Tue May 24 07:23:02 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 24 May 2022 07:23:02 GMT Subject: Integrated: 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure In-Reply-To: References: Message-ID: On Fri, 20 May 2022 09:45:19 GMT, Roberto Casta?eda Lozano wrote: > [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) introduced an assertion in the SLP analysis verifying that the examined loop does not contain inconsistent reduction information. The assertion is placed at the beginning of the SLP analysis, and can fail even for loops that are not vectorizable by SLP (and hence without risk of being miscompiled). This is the case for the [reported issue](https://bugs.openjdk.java.net/browse/JDK-8286177) and for many other recent failures triggered by JavaFuzzer-generated test cases. This changeset postpones the assertion to the SLP output phase to ensure that it only fails when the program really risks being miscompiled. > > Given that many transformations can invalidate reduction information (see for example the JBS reports for [JDK-8261147](https://bugs.openjdk.java.net/browse/JDK-8261147), [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622), and [JDK-8286177](https://bugs.openjdk.java.net/browse/JDK-8286177)), a more fundamental fix would be to remove the possibility of inconsistent reduction information by construction, by running the reduction analysis on-demand. I have filed a [RFE](https://bugs.openjdk.java.net/browse/JDK-8287087) to explore this idea after JDK 19. > > #### Testing > > - hs-tier1-5 (windows-x64, linux-x64, linux-aarch64, and macosx-x64; debug mode). > - Tested that the assertion, in its new placement, would still have caught the bug reported in [JDK-8279622](https://bugs.openjdk.java.net/browse/JDK-8279622) in the original scenario. > - Tested that the assertion, in its new placement, does not fail for 27 JavaFuzzer-generated test cases that trigger the assertion in its original placement. This pull request has now been integrated. Changeset: 6458a56e Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/6458a56e60472fb2fbe8fa60bbc856dc95f50f07 Stats: 67 lines in 2 files changed: 65 ins; 2 del; 0 mod 8286177: C2: "failed: non-reduction loop contains reduction nodes" assert failure Reviewed-by: chagedorn, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8805 From ngasson at openjdk.java.net Tue May 24 08:13:54 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 24 May 2022 08:13:54 GMT Subject: RFR: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn In-Reply-To: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> References: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> Message-ID: <2kLj4fmnDXg0mcARpR1ozapqcZL6dtHOE0JoGjugy-E=.5a1e295f-e66d-460d-b28e-795c5a704ef1@github.com> On Mon, 23 May 2022 14:55:34 GMT, Andrew Haley wrote: > This is fallout from the patch for JDK-8285923. > > The root cause of this bug is that there is a template definition of `cmp(register, immediate)` but there is not a template definition of `cmn(register, immediate)`. Given that we are close to rampdown, this patch fixes the bug in the most minimal way possible, by using `adds(zr, register, immediate)`, which correctly handles 64-bit operands. > > In the next release cycle we should tidy up `cmn()` in the same way that was done for JDK-8206895. > > Alternatively, we could back out JDK-8285923. I'd rather not, given that it fixes a real (if latent) bug, but if needs be I'll do so. Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8845 From shade at openjdk.java.net Tue May 24 08:39:39 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 24 May 2022 08:39:39 GMT Subject: RFR: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn In-Reply-To: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> References: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> Message-ID: On Mon, 23 May 2022 14:55:34 GMT, Andrew Haley wrote: > This is fallout from the patch for JDK-8285923. > > The root cause of this bug is that there is a template definition of `cmp(register, immediate)` but there is not a template definition of `cmn(register, immediate)`. Given that we are close to rampdown, this patch fixes the bug in the most minimal way possible, by using `adds(zr, register, immediate)`, which correctly handles 64-bit operands. > > In the next release cycle we should tidy up `cmn()` in the same way that was done for JDK-8206895. > > Alternatively, we could back out JDK-8285923. I'd rather not, given that it fixes a real (if latent) bug, but if needs be I'll do so. All right, so we just store the result to `zr`, which is effectively no-op, and ride on the updated flags done by `add*s*`. Looks fine. ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8845 From ngasson at openjdk.java.net Tue May 24 09:06:50 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 24 May 2022 09:06:50 GMT Subject: RFR: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn In-Reply-To: References: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> Message-ID: On Tue, 24 May 2022 08:36:49 GMT, Aleksey Shipilev wrote: > All right, so we just store the result to `zr`, which is effectively no-op, and ride on the updated flags done by `add*s*`. It's actually the same instruction: `cmn reg, imm` is an alias for `adds zr, reg, imm`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8845 From shade at openjdk.java.net Tue May 24 09:06:50 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 24 May 2022 09:06:50 GMT Subject: RFR: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn In-Reply-To: References: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> Message-ID: On Tue, 24 May 2022 09:03:32 GMT, Nick Gasson wrote: > > All right, so we just store the result to `zr`, which is effectively no-op, and ride on the updated flags done by `add*s*`. > > It's actually the same instruction: `cmn reg, imm` is an alias for `adds zr, reg, imm`. Nice. Even better then. ------------- PR: https://git.openjdk.java.net/jdk/pull/8845 From chagedorn at openjdk.java.net Tue May 24 09:28:38 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 24 May 2022 09:28:38 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Tue, 24 May 2022 00:19:35 GMT, Vladimir Kozlov wrote: > As I commented in bug report, it took 54 (cnt == 52) iterations to finish compilation TestMaxLoopOptsCountReached::test method. Should we just rise default LoopOptsCount flag's value and limit in the assert? I think the problem is to determine what a good value should be. I'm afraid we could probably come up with some other cases where we can reach the limit again and hit the assert with a larger value. I agree that there might be a problem with `major_progess` and we should not do so many optimizations. But in general, I'm not sure how we can prove that we will only hit the assert in case of a real bug and not just a false positive. Maybe we would need some additional heuristics to make a decision (like handling the live node limit)? This would need some more careful investigation. The way the assert is currently implemented does not help us. It will never be hit without explicitly disabling partial peeling. So, I think we should either remove it or change its value to 39 to possibly let it fail on the very last loop opts iteration. However, since we already have some different cases where we would fail with 39, this is currently not a good option. But I agree with Vladimir that it would be good to have some mechanism to warn us about such problems in the future. So, I think there are 2 problems to solve: - Investigating the known cases so far and figure out why they need so many loop opts iterations. - Make the assert useful again (blocked by the first problem). I'd suggest to investigate the cases we've found so far and file separate bugs for them if they turn out to be real bugs. Additionally, I suggest the following 2 options: - Remove the assert for now (given that it has never worked as it was supposed to work) and file an RFE to check if we can reintroduce it once the known cases/bugs are fixed and we have some confidence to not hit it again (could include tweaking the `LoopOptsCount` max value or introducing some other heuristics to avoid false positives). - Keep the assert as we have it today and just defer this bug and treat it like the RFE above. I would opt for option 1 given that the assert has no real value as of today. What do you think? Thanks, Christian ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From eosterlund at openjdk.java.net Tue May 24 11:27:55 2022 From: eosterlund at openjdk.java.net (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 24 May 2022 11:27:55 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:07:11 GMT, Vladimir Kozlov wrote: >> The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. >> If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. > > Did you run our regular performance testing with loom to see how this change affect performance? > Why 10x and not other number? @vnkozlov Is your concern that a user explicitly overrides the default to a value that ends up not being good? If so, I'm not sure why we would be in the business of preventing the user from shooting itself in the foot and guessing what the user really wanted here. Maybe I missed something. ------------- PR: https://git.openjdk.java.net/jdk/pull/8673 From rcastanedalo at openjdk.java.net Tue May 24 13:27:44 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 24 May 2022 13:27:44 GMT Subject: RFR: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn In-Reply-To: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> References: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> Message-ID: On Mon, 23 May 2022 14:55:34 GMT, Andrew Haley wrote: > This is fallout from the patch for JDK-8285923. > > The root cause of this bug is that there is a template definition of `cmp(register, immediate)` but there is not a template definition of `cmn(register, immediate)`. Given that we are close to rampdown, this patch fixes the bug in the most minimal way possible, by using `adds(zr, register, immediate)`, which correctly handles 64-bit operands. > > In the next release cycle we should tidy up `cmn()` in the same way that was done for JDK-8206895. > > Alternatively, we could back out JDK-8285923. I'd rather not, given that it fixes a real (if latent) bug, but if needs be I'll do so. Internal testing (hs-tier1-5 and affected JCK test case) looks good. ------------- PR: https://git.openjdk.java.net/jdk/pull/8845 From kvn at openjdk.java.net Tue May 24 14:13:04 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 14:13:04 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v15] In-Reply-To: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> References: <1U5ThvC2Kp9pmy2KP_WkP5Qsv4ZmomA093MBjNzJho0=.5efe9861-32eb-4eb2-9dc9-5a6871fe6cc7@github.com> Message-ID: On Mon, 23 May 2022 18:18:39 GMT, aamarsh wrote: >> Escape Analysis and Scalar Replacement statistics were added when the -XX:+PrintOptoStatistics flag is set. All code is placed in `#ifndef Product` block, so this code is only run when creating a debug build. Using renaissance benchmark I ran a few tests to confirm that numbers were printing correctly. Below is an example run: >> >> >> No escape = 263, Arg escape = 87, Global escape = 1628 >> Objects scalar replaced = 193, Monitor objects removed = 32, GC barriers removed = 38, Memory barriers removed = 225 > > aamarsh has updated the pull request incrementally with two additional commits since the last revision: > > - delete iterative EA comment > - account for iterative EA Looks good. I verified output. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Tue May 24 14:14:07 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 14:14:07 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: <77eWB32c9JUewXFDUrQBfyWRT0rnPCDKuxLIkNxY9QM=.1e42be0f-9c81-4749-9b6b-baacc5b5e39d@github.com> On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From kvn at openjdk.java.net Tue May 24 14:14:09 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 14:14:09 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Tue, 24 May 2022 09:23:31 GMT, Christian Hagedorn wrote: > * Remove the assert for now (given that it has never worked as it was supposed to work) and file an RFE to check if we can reintroduce it once the known cases/bugs are fixed and we have some confidence to not hit it again (could include tweaking the `LoopOptsCount` max value or introducing some other heuristics to avoid false positives). > * Keep the assert as we have it today and just defer this bug and treat it like the RFE above. > > I would opt for option 1 given that the assert has no real value as of today. Okay, I agree. ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From kvn at openjdk.java.net Tue May 24 14:27:51 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 14:27:51 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 07:30:39 GMT, Erik ?sterlund wrote: > The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. > If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. Marked as reviewed by kvn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8673 From kvn at openjdk.java.net Tue May 24 14:27:52 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 14:27:52 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:07:11 GMT, Vladimir Kozlov wrote: >> The normal sweeping heuristics trigger sweeping whenever 0.5% of the reserved code cache could have died. Normally that is fine, but with loom such sweeping requires a full GC cycle, as stacks can now be in the Java heap as well. In that context, 0.5% does seem to be a bit too trigger happy. So this patch adjusts that default when using loom to 10x higher. >> If you run something like jython which spins up a lot of code, it unsurprisingly triggers a lot less GCs due to code cache pressure. > > Did you run our regular performance testing with loom to see how this change affect performance? > Why 10x and not other number? > @vnkozlov Is your concern that a user explicitly overrides the default to a value that ends up not being good? If so, I'm not sure why we would be in the business of preventing the user from shooting itself in the foot and guessing what the user really wanted here. Maybe I missed something. Yes, it was my concern which was unjustifiable because I missed that this code is guarded by `FLAG_IS_DEFAULT(SweeperThreshold)`. So you simply set `SweeperThreshold` to 5% (default is 0.5) which is fine. ------------- PR: https://git.openjdk.java.net/jdk/pull/8673 From chagedorn at openjdk.java.net Tue May 24 14:30:43 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 24 May 2022 14:30:43 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Wed, 18 May 2022 13:06:45 GMT, Tobias Holenstein wrote: >> `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. >> >> I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. > > Tobias Holenstein has updated the pull request incrementally with one additional commit since the last revision: > > reformat spaces in test Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From chagedorn at openjdk.java.net Tue May 24 14:30:43 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 24 May 2022 14:30:43 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Tue, 24 May 2022 14:11:13 GMT, Vladimir Kozlov wrote: > > * Remove the assert for now (given that it has never worked as it was supposed to work) and file an RFE to check if we can reintroduce it once the known cases/bugs are fixed and we have some confidence to not hit it again (could include tweaking the `LoopOptsCount` max value or introducing some other heuristics to avoid false positives). > > * Keep the assert as we have it today and just defer this bug and treat it like the RFE above. > > > > I would opt for option 1 given that the assert has no real value as of today. > > Okay, I agree. Thanks for your feedback, Vladimir. If @tobiasholenstein also agrees, we can file the follow-up RFE and integrate this change. ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From eosterlund at openjdk.java.net Tue May 24 15:11:49 2022 From: eosterlund at openjdk.java.net (Erik =?UTF-8?B?w5ZzdGVybHVuZA==?=) Date: Tue, 24 May 2022 15:11:49 GMT Subject: RFR: 8284404: Too aggressive sweeping with Loom In-Reply-To: References: Message-ID: On Tue, 24 May 2022 14:23:43 GMT, Vladimir Kozlov wrote: > > @vnkozlov Is your concern that a user explicitly overrides the default to a value that ends up not being good? If so, I'm not sure why we would be in the business of preventing the user from shooting itself in the foot and guessing what the user really wanted here. Maybe I missed something. > > > > Yes, it was my concern which was unjustifiable because I missed that this code is guarded by `FLAG_IS_DEFAULT(SweeperThreshold)`. So you simply set `SweeperThreshold` to 5% (default is 0.5) which is fine. Okay great - thanks for the review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8673 From duke at openjdk.java.net Tue May 24 15:13:59 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Tue, 24 May 2022 15:13:59 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh [v2] In-Reply-To: References: Message-ID: On Mon, 23 May 2022 14:56:19 GMT, Raffaello Giulietti wrote: >> Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` > > Raffaello Giulietti has updated the pull request incrementally with one additional commit since the last revision: > > 8287139: aarch64 intrinsic for unsignedMultiplyHigh Hi, any committer willing to sponsor the integration? I don't have committer status. ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Tue May 24 15:54:53 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Tue, 24 May 2022 15:54:53 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh [v2] In-Reply-To: References: Message-ID: On Mon, 23 May 2022 15:32:18 GMT, Nick Gasson wrote: >> Raffaello Giulietti has updated the pull request incrementally with one additional commit since the last revision: >> >> 8287139: aarch64 intrinsic for unsignedMultiplyHigh > > Marked as reviewed by ngasson (Reviewer). Sorry @nick-arm to bother again, but the /sponsor command provoked nothing, even after waiting for 21 min. Could you please re-issue it again? Thanks ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From duke at openjdk.java.net Tue May 24 15:54:55 2022 From: duke at openjdk.java.net (Raffaello Giulietti) Date: Tue, 24 May 2022 15:54:55 GMT Subject: Integrated: 8287139: aarch64 intrinsic for unsignedMultiplyHigh In-Reply-To: References: Message-ID: <5ofC8b6UBfuyHtH8Jchy7zTfih79IpwVvmPkzxmDKtU=.ccb1be4a-9909-42f6-8458-3f2ab298ec2e@github.com> On Mon, 23 May 2022 11:59:02 GMT, Raffaello Giulietti wrote: > Adds aarch64 intrinsic support for `Math.unsignedMultiplyHigh()` This pull request has now been integrated. Changeset: fdc147e3 Author: Raffaello Giulietti Committer: Nick Gasson URL: https://git.openjdk.java.net/jdk/commit/fdc147e3540801822f5b15c9c5a76cacc92c4fd2 Stats: 17 lines in 1 file changed: 16 ins; 0 del; 1 mod 8287139: aarch64 intrinsic for unsignedMultiplyHigh Reviewed-by: aph, ngasson ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From kvn at openjdk.java.net Tue May 24 16:06:13 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 16:06:13 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Fri, 8 Apr 2022 07:15:03 GMT, Fei Gao wrote: >>> costing more than scalar instructions, as we know that there are only two elements for VectorCastD2I on 128-bit NEON machine. >> >> So shall we disable `vcvt2Dto2I` for NEON? > >> So shall we disable `vcvt2Dto2I` for NEON? > > I'm afraid we can't. We still need to support it in VectorAPI. @fg1417 Thank you for suggesting this optimization. I see that it was not updated for some time. Do you still intend to work on it? Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional `Matcher` property we could consult when trying to **auto-vectorize**. I understand that we need `vcvt2Dto2I` when VectorAPI specifically asking to generate it but we should not enforce auto-generation. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From kvn at openjdk.java.net Tue May 24 16:06:18 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 16:06:18 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v5] In-Reply-To: <1B3doPHzaGFUCT_qkYIrlYzBgvs_nbEzjKcAlPSZTeM=.15f00dcc-9f2d-45fa-834b-2c2a129b149e@github.com> References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <1B3doPHzaGFUCT_qkYIrlYzBgvs_nbEzjKcAlPSZTeM=.15f00dcc-9f2d-45fa-834b-2c2a129b149e@github.com> Message-ID: On Thu, 12 May 2022 07:11:20 GMT, Fei Gao wrote: >> After JDK-8275317, C2's SLP vectorizer has supported type conversion between the same data size. We can also support conversions between different data sizes like: >> int <-> double >> float <-> long >> int <-> long >> float <-> double >> >> A typical test case: >> >> int[] a; >> double[] b; >> for (int i = start; i < limit; i++) { >> b[i] = (double) a[i]; >> } >> >> Our expected OptoAssembly code for one iteration is like below: >> >> add R12, R2, R11, LShiftL #2 >> vector_load V16,[R12, #16] >> vectorcast_i2d V16, V16 # convert I to D vector >> add R11, R1, R11, LShiftL #3 # ptr >> add R13, R11, #16 # ptr >> vector_store [R13], V16 >> >> To enable the vectorization, the patch solves the following problems in the SLP. >> >> There are three main operations in the case above, LoadI, ConvI2D and StoreD. Assuming that the vector length is 128 bits, how many scalar nodes should be packed together to a vector? If we decide it separately for each operation node, like what we did before the patch in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes in a vector node sequence, like loading 4 elements to a vector, then typecasting 2 elements and lastly storing these 2 elements, they become invalid. As a result, we should look through the whole def-use chain >> and then pick up the minimum of these element sizes, like function SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then generate valid vector node sequence, like loading 2 elements, converting the 2 elements to another type and storing the 2 elements with new type. >> >> After this, LoadI nodes don't make full use of the whole vector and only occupy part of it. So we adapt the code in SuperWord::get_vw_bytes_special() to the situation. >> >> In SLP, we calculate a kind of alignment as position trace for each scalar node in the whole vector. In this case, the alignments for 2 LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which mark that this node is the second node in the whole vector, while the difference between 4 and 8 are just because of their own data sizes. In this situation, we should try to remove the impact caused by different data size in SLP. For example, in the stage of SuperWord::extend_packlist(), while determining if it's potential to pack a pair of def nodes in the function SuperWord::follow_use_defs(), we remove the side effect of different data size by transforming the target alignment from the use node. Because we believe that, assuming that the vector length is 512 bits, if the ConvI2D use nodes have alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, these two LoadI nodes should be packed a s a pair as well. >> >> Similarly, when determining if the vectorization is profitable, type conversion between different data size takes a type of one size and produces a type of another size, hence the special checks on alignment and size should be applied, like what we do in SuperWord::is_vector_use(). >> >> After solving these problems, we successfully implemented the vectorization of type conversion between different data sizes. >> >> Here is the test data (-XX:+UseSuperWord) on NEON: >> >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 216.431 ? 0.131 ns/op >> convertD2I 523 avgt 15 220.522 ? 0.311 ns/op >> convertF2D 523 avgt 15 217.034 ? 0.292 ns/op >> convertF2L 523 avgt 15 231.634 ? 1.881 ns/op >> convertI2D 523 avgt 15 229.538 ? 0.095 ns/op >> convertI2L 523 avgt 15 214.822 ? 0.131 ns/op >> convertL2F 523 avgt 15 230.188 ? 0.217 ns/op >> convertL2I 523 avgt 15 162.234 ? 0.235 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 124.352 ? 1.079 ns/op >> convertD2I 523 avgt 15 557.388 ? 8.166 ns/op >> convertF2D 523 avgt 15 118.082 ? 4.026 ns/op >> convertF2L 523 avgt 15 225.810 ? 11.180 ns/op >> convertI2D 523 avgt 15 166.247 ? 0.120 ns/op >> convertI2L 523 avgt 15 119.699 ? 2.925 ns/op >> convertL2F 523 avgt 15 220.847 ? 0.053 ns/op >> convertL2I 523 avgt 15 122.339 ? 2.738 ns/op >> >> perf data on X86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 279.466 ? 0.069 ns/op >> convertD2I 523 avgt 15 551.009 ? 7.459 ns/op >> convertF2D 523 avgt 15 276.066 ? 0.117 ns/op >> convertF2L 523 avgt 15 545.108 ? 5.697 ns/op >> convertI2D 523 avgt 15 745.303 ? 0.185 ns/op >> convertI2L 523 avgt 15 260.878 ? 0.044 ns/op >> convertL2F 523 avgt 15 502.016 ? 0.172 ns/op >> convertL2I 523 avgt 15 261.654 ? 3.326 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 106.975 ? 0.045 ns/op >> convertD2I 523 avgt 15 546.866 ? 9.287 ns/op >> convertF2D 523 avgt 15 82.414 ? 0.340 ns/op >> convertF2L 523 avgt 15 542.235 ? 2.785 ns/op >> convertI2D 523 avgt 15 92.966 ? 1.400 ns/op >> convertI2L 523 avgt 15 79.960 ? 0.528 ns/op >> convertL2F 523 avgt 15 504.712 ? 4.794 ns/op >> convertL2I 523 avgt 15 129.753 ? 0.094 ns/op >> >> perf data on AVX512: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 282.984 ? 4.022 ns/op >> convertD2I 523 avgt 15 543.080 ? 3.873 ns/op >> convertF2D 523 avgt 15 273.950 ? 0.131 ns/op >> convertF2L 523 avgt 15 539.568 ? 2.747 ns/op >> convertI2D 523 avgt 15 745.238 ? 0.069 ns/op >> convertI2L 523 avgt 15 260.935 ? 0.169 ns/op >> convertL2F 523 avgt 15 501.870 ? 0.359 ns/op >> convertL2I 523 avgt 15 257.508 ? 0.174 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> convertD2F 523 avgt 15 76.687 ? 0.530 ns/op >> convertD2I 523 avgt 15 545.408 ? 4.657 ns/op >> convertF2D 523 avgt 15 273.935 ? 0.099 ns/op >> convertF2L 523 avgt 15 540.534 ? 3.032 ns/op >> convertI2D 523 avgt 15 745.234 ? 0.053 ns/op >> convertI2L 523 avgt 15 260.865 ? 0.104 ns/op >> convertL2F 523 avgt 15 63.834 ? 4.777 ns/op >> convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' into fg8283091 > > Change-Id: I8deeae48449f1fc159c9bb5f82773e1bc6b5105f > - Merge branch 'master' into fg8283091 > > Change-Id: I1dfb4a6092302267e3796e08d411d0241b23df83 > - Add micro-benchmark cases > > Change-Id: I3c741255804ce410c8b6dcbdec974fa2c9051fd8 > - Merge branch 'master' into fg8283091 > > Change-Id: I674581135fd0844accc65520574fcef161eededa > - 8283091: Support type conversion between different data sizes in SLP > > After JDK-8275317, C2's SLP vectorizer has supported type conversion > between the same data size. We can also support conversions between > different data sizes like: > int <-> double > float <-> long > int <-> long > float <-> double > > A typical test case: > > int[] a; > double[] b; > for (int i = start; i < limit; i++) { > b[i] = (double) a[i]; > } > > Our expected OptoAssembly code for one iteration is like below: > > add R12, R2, R11, LShiftL #2 > vector_load V16,[R12, #16] > vectorcast_i2d V16, V16 # convert I to D vector > add R11, R1, R11, LShiftL #3 # ptr > add R13, R11, #16 # ptr > vector_store [R13], V16 > > To enable the vectorization, the patch solves the following problems > in the SLP. > > There are three main operations in the case above, LoadI, ConvI2D and > StoreD. Assuming that the vector length is 128 bits, how many scalar > nodes should be packed together to a vector? If we decide it > separately for each operation node, like what we did before the patch > in SuperWord::combine_packs(), a 128-bit vector will support 4 LoadI > or 2 ConvI2D or 2 StoreD nodes. However, if we put these packed nodes > in a vector node sequence, like loading 4 elements to a vector, then > typecasting 2 elements and lastly storing these 2 elements, they become > invalid. As a result, we should look through the whole def-use chain > and then pick up the minimum of these element sizes, like function > SuperWord::max_vector_size_in_ud_chain() do in the superword.cpp. > In this case, we pack 2 LoadI, 2 ConvI2D and 2 StoreD nodes, and then > generate valid vector node sequence, like loading 2 elements, > converting the 2 elements to another type and storing the 2 elements > with new type. > > After this, LoadI nodes don't make full use of the whole vector and > only occupy part of it. So we adapt the code in > SuperWord::get_vw_bytes_special() to the situation. > > In SLP, we calculate a kind of alignment as position trace for each > scalar node in the whole vector. In this case, the alignments for 2 > LoadI nodes are 0, 4 while the alignment for 2 ConvI2D nodes are 0, 8. > Sometimes, 4 for LoadI and 8 for ConvI2D work the same, both of which > mark that this node is the second node in the whole vector, while the > difference between 4 and 8 are just because of their own data sizes. In > this situation, we should try to remove the impact caused by different > data size in SLP. For example, in the stage of > SuperWord::extend_packlist(), while determining if it's potential to > pack a pair of def nodes in the function SuperWord::follow_use_defs(), > we remove the side effect of different data size by transforming the > target alignment from the use node. Because we believe that, assuming > that the vector length is 512 bits, if the ConvI2D use nodes have > alignments of 16-24 and their def nodes, LoadI, have alignments of 8-12, > these two LoadI nodes should be packed as a pair as well. > > Similarly, when determining if the vectorization is profitable, type > conversion between different data size takes a type of one size and > produces a type of another size, hence the special checks on alignment > and size should be applied, like what we do in SuperWord::is_vector_use. > > After solving these problems, we successfully implemented the > vectorization of type conversion between different data sizes. > > Here is the test data on NEON: > > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 216.431 ? 0.131 ns/op > VectorLoop.convertD2I 523 avgt 15 220.522 ? 0.311 ns/op > VectorLoop.convertF2D 523 avgt 15 217.034 ? 0.292 ns/op > VectorLoop.convertF2L 523 avgt 15 231.634 ? 1.881 ns/op > VectorLoop.convertI2D 523 avgt 15 229.538 ? 0.095 ns/op > VectorLoop.convertI2L 523 avgt 15 214.822 ? 0.131 ns/op > VectorLoop.convertL2F 523 avgt 15 230.188 ? 0.217 ns/op > VectorLoop.convertL2I 523 avgt 15 162.234 ? 0.235 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 124.352 ? 1.079 ns/op > VectorLoop.convertD2I 523 avgt 15 557.388 ? 8.166 ns/op > VectorLoop.convertF2D 523 avgt 15 118.082 ? 4.026 ns/op > VectorLoop.convertF2L 523 avgt 15 225.810 ? 11.180 ns/op > VectorLoop.convertI2D 523 avgt 15 166.247 ? 0.120 ns/op > VectorLoop.convertI2L 523 avgt 15 119.699 ? 2.925 ns/op > VectorLoop.convertL2F 523 avgt 15 220.847 ? 0.053 ns/op > VectorLoop.convertL2I 523 avgt 15 122.339 ? 2.738 ns/op > > perf data on X86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 279.466 ? 0.069 ns/op > VectorLoop.convertD2I 523 avgt 15 551.009 ? 7.459 ns/op > VectorLoop.convertF2D 523 avgt 15 276.066 ? 0.117 ns/op > VectorLoop.convertF2L 523 avgt 15 545.108 ? 5.697 ns/op > VectorLoop.convertI2D 523 avgt 15 745.303 ? 0.185 ns/op > VectorLoop.convertI2L 523 avgt 15 260.878 ? 0.044 ns/op > VectorLoop.convertL2F 523 avgt 15 502.016 ? 0.172 ns/op > VectorLoop.convertL2I 523 avgt 15 261.654 ? 3.326 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 106.975 ? 0.045 ns/op > VectorLoop.convertD2I 523 avgt 15 546.866 ? 9.287 ns/op > VectorLoop.convertF2D 523 avgt 15 82.414 ? 0.340 ns/op > VectorLoop.convertF2L 523 avgt 15 542.235 ? 2.785 ns/op > VectorLoop.convertI2D 523 avgt 15 92.966 ? 1.400 ns/op > VectorLoop.convertI2L 523 avgt 15 79.960 ? 0.528 ns/op > VectorLoop.convertL2F 523 avgt 15 504.712 ? 4.794 ns/op > VectorLoop.convertL2I 523 avgt 15 129.753 ? 0.094 ns/op > > perf data on AVX512: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 282.984 ? 4.022 ns/op > VectorLoop.convertD2I 523 avgt 15 543.080 ? 3.873 ns/op > VectorLoop.convertF2D 523 avgt 15 273.950 ? 0.131 ns/op > VectorLoop.convertF2L 523 avgt 15 539.568 ? 2.747 ns/op > VectorLoop.convertI2D 523 avgt 15 745.238 ? 0.069 ns/op > VectorLoop.convertI2L 523 avgt 15 260.935 ? 0.169 ns/op > VectorLoop.convertL2F 523 avgt 15 501.870 ? 0.359 ns/op > VectorLoop.convertL2I 523 avgt 15 257.508 ? 0.174 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > VectorLoop.convertD2F 523 avgt 15 76.687 ? 0.530 ns/op > VectorLoop.convertD2I 523 avgt 15 545.408 ? 4.657 ns/op > VectorLoop.convertF2D 523 avgt 15 273.935 ? 0.099 ns/op > VectorLoop.convertF2L 523 avgt 15 540.534 ? 3.032 ns/op > VectorLoop.convertI2D 523 avgt 15 745.234 ? 0.053 ns/op > VectorLoop.convertI2L 523 avgt 15 260.865 ? 0.104 ns/op > VectorLoop.convertL2F 523 avgt 15 63.834 ? 4.777 ns/op > VectorLoop.convertL2I 523 avgt 15 48.183 ? 0.990 ns/op > > Change-Id: I93e60fd956547dad9204ceec90220145c58a72ef And separate thank you for updating tests to verify correctness of this optimization! ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From ngasson at openjdk.java.net Tue May 24 16:07:59 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 24 May 2022 16:07:59 GMT Subject: RFR: 8287139: aarch64 intrinsic for unsignedMultiplyHigh [v2] In-Reply-To: References: Message-ID: On Mon, 23 May 2022 15:32:18 GMT, Nick Gasson wrote: >> Raffaello Giulietti has updated the pull request incrementally with one additional commit since the last revision: >> >> 8287139: aarch64 intrinsic for unsignedMultiplyHigh > > Marked as reviewed by ngasson (Reviewer). > Sorry @nick-arm to bother again, but the /sponsor command provoked nothing, even after waiting for 21 min. Could you please re-issue it again? Thanks Looks like the bot was just running slow :-) ------------- PR: https://git.openjdk.java.net/jdk/pull/8840 From shade at openjdk.java.net Tue May 24 16:33:59 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 24 May 2022 16:33:59 GMT Subject: RFR: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 [v2] In-Reply-To: References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: <_BGnJA2MSjSQ-pDoDZhjkvO4QzCm1RYxQhDIHweEQQY=.09a6eb83-a13e-441f-8890-74fbbbdf5394@github.com> On Mon, 23 May 2022 19:12:27 GMT, Aleksey Shipilev wrote: >> See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. >> >> Attn @dean-long. >> >> Testing: >> - [x] Affected test on Linux x86_64 fastdebug (still passes) >> - [x] Affected test on Linux x86_32 fastdebug (now passes) > > Aleksey Shipilev has updated the pull request incrementally with one additional commit since the last revision: > > Review comments Thanks for reviews. No problem on regressing non-`tier1` stuff every once in a while! ------------- PR: https://git.openjdk.java.net/jdk/pull/8851 From shade at openjdk.java.net Tue May 24 16:34:01 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Tue, 24 May 2022 16:34:01 GMT Subject: Integrated: 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 In-Reply-To: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> References: <7Z-Js1_CRnu_t7PahDNez-IVrtDAjsRVcJaHJnGkBR4=.f9fa2299-5177-468e-9447-7c4aa497d6a8@github.com> Message-ID: <7gfxLPJKQ63Kv81mCpOVwyelMgH1k7F_l4ZnLaYujxk=.e58593ce-8c9d-4178-9d70-cd2bba4e840b@github.com> On Mon, 23 May 2022 17:03:38 GMT, Aleksey Shipilev wrote: > See the bug report, recent regression. I believe the code makes the unwarranted assumption that we run on 64-bit platform, and thus caps at > 2^63 only. It should also cap at > 2^31 for 32-bit platforms. > > Attn @dean-long. > > Testing: > - [x] Affected test on Linux x86_64 fastdebug (still passes) > - [x] Affected test on Linux x86_32 fastdebug (now passes) This pull request has now been integrated. Changeset: fdece9ac Author: Aleksey Shipilev URL: https://git.openjdk.java.net/jdk/commit/fdece9ac71e865371ef7e348c54bca21235efdb3 Stats: 5 lines in 1 file changed: 3 ins; 0 del; 2 mod 8287169: compiler/arguments/TestCompileThresholdScaling.java fails on x86_32 after JDK-8287052 Reviewed-by: kvn, dlong ------------- PR: https://git.openjdk.java.net/jdk/pull/8851 From duke at openjdk.java.net Tue May 24 16:42:13 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Tue, 24 May 2022 16:42:13 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v5] In-Reply-To: References: Message-ID: > The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: > > - Asserted the condition that makes sure the code runs at least once > - Asserted the condition that makes sure the code runs only once > - Removed the `while` loop > - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. > - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. > > Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: Added some whitespace ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8684/files - new: https://git.openjdk.java.net/jdk/pull/8684/files/df661be7..7abf914a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8684&range=03-04 Stats: 3 lines in 1 file changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8684.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8684/head:pull/8684 PR: https://git.openjdk.java.net/jdk/pull/8684 From jrose at openjdk.java.net Tue May 24 17:22:59 2022 From: jrose at openjdk.java.net (John R Rose) Date: Tue, 24 May 2022 17:22:59 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v6] In-Reply-To: References: Message-ID: On Mon, 16 May 2022 15:20:08 GMT, Jatin Bhateja wrote: >> src/hotspot/share/opto/intrinsicnode.cpp line 213: >> >>> 211: } >>> 212: >>> 213: Node* ExpandBitsNode::Identity(PhaseGVN* phase) { >> >> I also suggest adding a boolean if `compress_expand_identity` if you add rules which don't apply to both equally. >> >> Here is possible type-propagation logic for compress and expand: >> >> >> let SIGN_BIT = (((IntOrLong)-1)>>>1)+1 (bit 31 or 63) >> let MAX_POS = (((IntOrLong)-1)>>>1) >> lot BITS = 1+bitCount(MAX_POS) (32 or 64) >> if (both x, m are con) { >> // maybe use these rules, by porting the Java code to C++ >> compress(CON[x], CON[m]) ] = CON[portable_compress(x,m)] >> expand(CON[x], CON[m]) ] = CON[portable_expand(x,m)] >> // see also https://stackoverflow.com/questions/38938911/portable-efficient-alternative-to-pdep-without-using-bmi2 >> } else if (m is CON[m] && m != -1) { >> //compress(x, -1) = x //identity handled elsewhere >> //expand(x, -1) = x //identity handled elsewhere >> let bitc = bitCount(m) >> LO[ compress(x, CON[m]) ] = 0 //sign bit is never set >> HI[ compress(x, CON[m]) ] = ((1L<> LO[ expand(x, CON[m]) ] = (m >= 0) ? 0 : SIGN_BIT //sign bit might be set alone >> HI[ expand(x, CON[m]) ] = (m >= 0) ? m : m ^ SIGN_BIT >> // could improve a little by looking TYPE[x], but do not bother >> } else { >> // estimate maximum possible weight of m (in 0..63) >> let maxbitc = BITS if (LO[m] < 0 && HI[m] >= -1) // could be -1 >> else maxbitc = BITS-1 if (LO[m] < 0 || HI[m] == MAX_POS) // <0 or maxint >> else maxbitc = BITS-1 - numberOfLeadingZeros(HI[m]) >> LO[ compress(x, m) ] = (maxbitc == 64 && LO[x] < 0) ? SIGN_BIT : 0 >> HI[ compress(x, m) ] = (maxbitc >= 63) ? HI[x] : MIN(HI[x], (1L<> LO[ expand(x, m) ] = (LO[m] >= 0) ? 0 : SIGN_BIT >> HI[ expand(x, m) ] = (LO[m] >= 0) ? HI[m] : MAX_POS >> } >> >> >> The operands of compress and expand are inherently unsigned bitmasks, so the signed type system of C2 gets in the way. In the future, a somewhat more thorough job could be done if we had bitwise types as well in C2. For that that would mean, see https://bugs.openjdk.java.net/browse/JDK-8001436 > > I have handled these transformation separately in ideal/identity and value routines. OK. I see you did just the constant-folding part of `Value` which is reasonable. Please file a followup bug to capture the more elaborate type inferencing proposal, for later use if warranted. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From duke at openjdk.java.net Tue May 24 20:31:53 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 24 May 2022 20:31:53 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9] In-Reply-To: References: Message-ID: > We develop optimized x86_64 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show ~6x improvement for `isNan()`, ~2x improvement for `isInfinite()` and 40% gain for `isFinite()` using` vfpclasss(s/d)` instructions. > > > JMH Benchmark (ns/op) Baseline This PR (WITH vfpclassss/sd) Speedup > > FloatClassCheck.testIsFinite 0.559 0.4 1.4x > FloatClassCheck.testIsInfinite 0.828 0.386 2.15x > FloatClassCheck.testIsNaN 2.589 0.387 6.7x > DoubleClassCheck.testIsFinite 0.568 0.414 1.37x > DoubleClassCheck.testIsInfinite 0.836 0.395 2.11x > DoubleClassCheck.testIsNaN 2.592 0.393 6.6x > > JMH Benchmark (ns/op) Baseline This PR (WITHOUT vfpclassss/sd) Speedup > FloatClassCheck.testIsFinite 0.561 0.468 1.2x > FloatClassCheck.testIsInfinite 0.793 0.491 1.61x > FloatClassCheck.testIsNaN 2.587 0.469 5.5x > DoubleClassCheck.testIsFinite 0.561 0.592 0.94x > DoubleClassCheck.testIsInfinite 0.828 0.592 1.4x > DoubleClassCheck.testIsNaN 2.593 0.594 4.4x Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: - Remove support for non vfpclasss/d based intrinsics - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - add comment for vfpclasss/d for isFinite() - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - zero out the upper bits not written by setb - use 0x1 to be simpler - remove the redundant temp register - Split the macros using predicate - update jmh tests - Merge branch 'master' into float - ... and 1 more: https://git.openjdk.java.net/jdk/compare/c1db70d8...70bba0fe ------------- Changes: https://git.openjdk.java.net/jdk/pull/8459/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=08 Stats: 663 lines in 19 files changed: 661 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Tue May 24 20:56:57 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 24 May 2022 20:56:57 GMT Subject: RFR: 8285868: x86_64 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 15:42:34 GMT, Vladimir Kozlov wrote: >> Hi Vladimir (@vnkozlov) >> >> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. >> >> Is it Ok to skip support for **non** `vfpclassd` for 32bit? >> >> >> void C2_MacroAssembler::double_class_check_sse(int opcode, XMMRegister src, Register dst, Register temp, Register temp1) { >> int32_t POS_INF_HI = 0x7ff00000; // hi 32bits >> int32_t KILL_SIGN_MASK_HI = 0x7fffffff; // hi 32 bits >> >> pshuflw(src, src, 0x4e); //switch hi to lo >> movdl(temp, src); >> movl(temp1, KILL_SIGN_MASK_HI); >> andl(temp, temp1); >> movl(temp1, POS_INF_HI); >> cmpl(temp, temp1); >> switch (opcode) { >> case Op_IsFiniteD: >> setb(Assembler::below, dst); >> break; >> case Op_IsInfiniteD: >> setb(Assembler::equal, dst); >> break; >> case Op_IsNaND: >> setb(Assembler::above, dst); >> break; >> default: >> assert(false, "%s", NodeClassNames[opcode]); >> } >> andl(dst, 0xff); >> } > >> For 32bit, in the case of double, we see performance improvement using `vfpclasssd` instruction but **without** `vfpclassd`, we see **40% decrease** in performance for `isFinite()` compared to the original Java code. Below, is the code which implements the intrinsic using SSE. >> >> Is it Ok to skip support for **non** `vfpclassd` for 32bit? > > Yes, but add comment about that. Also for 32-bit you need to check SSE2 support which is required by `pshuflw`. Hi Vladimir (@vnkozlov), Could you pls review this updated PR? In this updated patch, we **removed** the intrinsics using **non**`-vpfclasss/d` instructions. - Got the new performance data for `vfpclasss/d` intrinsics, after rebasing with the latest changes (which include #8525 submitted by @merykitty). - Using `vfpclasss/d` instruction gives upto `70%` speedup over the existing baseline. - This works for both 64 bit and 32 bit as well. Please see the updated data shown below (also updated the RFE main text as well) Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) FloatClassCheck.testIsFinite 0.562 0.406 28% FloatClassCheck.testIsInfinite 0.815 0.383 53% FloatClassCheck.testIsNaN 0.63 0.382 39% DoubleClassCheck.testIsFinite 0.565 0.409 28% DoubleClassCheck.testIsInfinite 0.812 0.375 54% DoubleClassCheck.testIsNaN 0.631 0.38 40% FPComparison.isFiniteDouble 332.638 272.577 18% FPComparison.isFiniteFloat 413.217 331.825 20% FPComparison.isInfiniteDouble 874.897 240.632 72% FPComparison.isInfiniteFloat 872.279 321.269 63% FPComparison.isNanDouble 286.566 240.36 16% FPComparison.isNanFloat 346.123 316.923 8% Thanks, Vamsi ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Tue May 24 22:04:05 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 22:04:05 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9] In-Reply-To: References: Message-ID: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> On Tue, 24 May 2022 20:31:53 GMT, Srinivas Vamsi Parasa wrote: >> We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. >> >> >> Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) >> FloatClassCheck.testIsFinite 0.562 0.406 28% >> FloatClassCheck.testIsInfinite 0.815 0.383 53% >> FloatClassCheck.testIsNaN 0.63 0.382 39% >> DoubleClassCheck.testIsFinite 0.565 0.409 28% >> DoubleClassCheck.testIsInfinite 0.812 0.375 54% >> DoubleClassCheck.testIsNaN 0.631 0.38 40% >> FPComparison.isFiniteDouble 332.638 272.577 18% >> FPComparison.isFiniteFloat 413.217 331.825 20% >> FPComparison.isInfiniteDouble 874.897 240.632 72% >> FPComparison.isInfiniteFloat 872.279 321.269 63% >> FPComparison.isNanDouble 286.566 240.36 16% >> FPComparison.isNanFloat 346.123 316.923 8% > > Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: > > - Remove support for non vfpclasss/d based intrinsics > - Merge branch 'master' of https://git.openjdk.java.net/jdk into float > - add comment for vfpclasss/d for isFinite() > - Merge branch 'master' of https://git.openjdk.java.net/jdk into float > - zero out the upper bits not written by setb > - use 0x1 to be simpler > - remove the redundant temp register > - Split the macros using predicate > - update jmh tests > - Merge branch 'master' into float > - ... and 1 more: https://git.openjdk.java.net/jdk/compare/c1db70d8...70bba0fe Looks good. Please, adopt @merykitty suggested changes for micro-benchmarks to include `store, cmove, branch` cases. Show performance results for all of them. Also adapt your regression tests for all 3 cases too. I assume `Baseline` data includes #8525 changes. Right? src/hotspot/share/runtime/vmStructs.cpp line 1847: > 1845: declare_c2_type(SignumDNode, Node) \ > 1846: declare_c2_type(SignumFNode, Node) \ > 1847: declare_c2_type(IsInfiniteFNode, Node) \ Where are other new nodes? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Tue May 24 22:09:16 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 24 May 2022 22:09:16 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9] In-Reply-To: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Tue, 24 May 2022 22:00:20 GMT, Vladimir Kozlov wrote: > I assume `Baseline` data includes #8525 changes. Right? Yes, the basline data includes #8525 changes. > Looks good. > > Please, adopt @merykitty suggested changes for micro-benchmarks to include `store, cmove, branch` cases. Show performance results for all of them. > > Also adapt your regression tests for all 3 cases too. Sure, I will update the microbenchmarks to include `store, cmove, branch` cases and post the performance data. > src/hotspot/share/runtime/vmStructs.cpp line 1847: > >> 1845: declare_c2_type(SignumDNode, Node) \ >> 1846: declare_c2_type(SignumFNode, Node) \ >> 1847: declare_c2_type(IsInfiniteFNode, Node) \ > > Where are other new nodes? I did not know how to combine the 3 nodes into one. Could you pls give few pointers to refer to for implementing them? ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Tue May 24 22:53:20 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 24 May 2022 22:53:20 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v10] In-Reply-To: References: Message-ID: > We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. > > > Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) > FloatClassCheck.testIsFinite 0.562 0.406 28% > FloatClassCheck.testIsInfinite 0.815 0.383 53% > FloatClassCheck.testIsNaN 0.63 0.382 39% > DoubleClassCheck.testIsFinite 0.565 0.409 28% > DoubleClassCheck.testIsInfinite 0.812 0.375 54% > DoubleClassCheck.testIsNaN 0.631 0.38 40% > FPComparison.isFiniteDouble 332.638 272.577 18% > FPComparison.isFiniteFloat 413.217 331.825 20% > FPComparison.isInfiniteDouble 874.897 240.632 72% > FPComparison.isInfiniteFloat 872.279 321.269 63% > FPComparison.isNanDouble 286.566 240.36 16% > FPComparison.isNanFloat 346.123 316.923 8% Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 13 commits: - update vmstructs - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - Remove support for non vfpclasss/d based intrinsics - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - add comment for vfpclasss/d for isFinite() - Merge branch 'master' of https://git.openjdk.java.net/jdk into float - zero out the upper bits not written by setb - use 0x1 to be simpler - remove the redundant temp register - Split the macros using predicate - ... and 3 more: https://git.openjdk.java.net/jdk/compare/9b7e42c0...dca4ec4c ------------- Changes: https://git.openjdk.java.net/jdk/pull/8459/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=09 Stats: 668 lines in 19 files changed: 666 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Tue May 24 22:53:22 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 24 May 2022 22:53:22 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9] In-Reply-To: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Tue, 24 May 2022 21:52:34 GMT, Vladimir Kozlov wrote: >> Srinivas Vamsi Parasa has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 11 commits: >> >> - Remove support for non vfpclasss/d based intrinsics >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into float >> - add comment for vfpclasss/d for isFinite() >> - Merge branch 'master' of https://git.openjdk.java.net/jdk into float >> - zero out the upper bits not written by setb >> - use 0x1 to be simpler >> - remove the redundant temp register >> - Split the macros using predicate >> - update jmh tests >> - Merge branch 'master' into float >> - ... and 1 more: https://git.openjdk.java.net/jdk/compare/c1db70d8...70bba0fe > > src/hotspot/share/runtime/vmStructs.cpp line 1847: > >> 1845: declare_c2_type(SignumDNode, Node) \ >> 1846: declare_c2_type(SignumFNode, Node) \ >> 1847: declare_c2_type(IsInfiniteFNode, Node) \ > > Where are other new nodes? Fixed it and pushed a new update. Sorry, that got missed! ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Tue May 24 22:53:24 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Tue, 24 May 2022 22:53:24 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: References: Message-ID: <0HKJDxPhDCx0omsyTSz9iM_SReaTx73b3VlCjhrWjl4=.b125ea18-cc74-4ae7-b494-39a29aedd631@github.com> On Wed, 18 May 2022 06:21:16 GMT, Jatin Bhateja wrote: >> Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: >> >> use 0x1 to be simpler > > test/micro/org/openjdk/bench/java/lang/FloatClassCheck.java line 45: > >> 43: public class FloatClassCheck { >> 44: >> 45: RandomGenerator rng; > > Just a suggestion we can also create one benchmark to handle both the floating point types. True, I thought keeping them separate would allow future tests could be added for each type. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Tue May 24 23:09:57 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 23:09:57 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v9] In-Reply-To: References: <3nPsl4E5pwB2poYUil8N39ZnfxLlKP4KE8JYumTb0Mc=.8de322fc-0074-416f-a074-6b1d2c712d9b@github.com> Message-ID: On Tue, 24 May 2022 22:46:34 GMT, Srinivas Vamsi Parasa wrote: >> src/hotspot/share/runtime/vmStructs.cpp line 1847: >> >>> 1845: declare_c2_type(SignumDNode, Node) \ >>> 1846: declare_c2_type(SignumFNode, Node) \ >>> 1847: declare_c2_type(IsInfiniteFNode, Node) \ >> >> Where are other new nodes? > > Fixed it and pushed a new update. Sorry, that got missed! Good. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From kvn at openjdk.java.net Tue May 24 23:10:00 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 24 May 2022 23:10:00 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v6] In-Reply-To: <0HKJDxPhDCx0omsyTSz9iM_SReaTx73b3VlCjhrWjl4=.b125ea18-cc74-4ae7-b494-39a29aedd631@github.com> References: <0HKJDxPhDCx0omsyTSz9iM_SReaTx73b3VlCjhrWjl4=.b125ea18-cc74-4ae7-b494-39a29aedd631@github.com> Message-ID: <2gY1gTh2HZtCkKJLd5b3QM-o1YYp8E6JDLrWXQrI9rw=.c6ce5803-813d-428c-9187-738b91c0e1fb@github.com> On Tue, 24 May 2022 22:49:34 GMT, Srinivas Vamsi Parasa wrote: >> test/micro/org/openjdk/bench/java/lang/FloatClassCheck.java line 45: >> >>> 43: public class FloatClassCheck { >>> 44: >>> 45: RandomGenerator rng; >> >> Just a suggestion we can also create one benchmark to handle both the floating point types. > > True, I thought keeping them separate would allow future tests could be added for each type. Yes, please keep them separate. ------------- PR: https://git.openjdk.java.net/jdk/pull/8459 From duke at openjdk.java.net Wed May 25 00:33:07 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Wed, 25 May 2022 00:33:07 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v4] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 12:05:44 GMT, Roberto Casta?eda Lozano wrote: >> Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: >> >> Removing whitespace > > src/hotspot/share/opto/lcm.cpp line 336: > >> 334: //mach is a store, hence block is the immediate dominator of mb. >> 335: //Due to the null-check shape of block (where its successors cannot re-join), >> 336: //block must be the direct predecessor of mb. > > Please, introduce a single space between each `//` and the comment text. Added, thank you! ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From fgao at openjdk.java.net Wed May 25 01:16:55 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Wed, 25 May 2022 01:16:55 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Tue, 24 May 2022 16:02:54 GMT, Vladimir Kozlov wrote: > Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. > > I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional `Matcher` property we could consult when trying to **auto-vectorize**. I understand that we need `vcvt2Dto2I` when VectorAPI specifically asking to generate it but we should not enforce auto-generation. @vnkozlov thanks for your review and kind suggestion! I'll update the patch to resolve the potential performance regression. ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From duke at openjdk.java.net Wed May 25 04:06:33 2022 From: duke at openjdk.java.net (Srinivas Vamsi Parasa) Date: Wed, 25 May 2022 04:06:33 GMT Subject: RFR: 8285868: x86 intrinsics for floating point methods isNaN, isFinite and isInfinite [v11] In-Reply-To: References: Message-ID: <141nY9CEDjk6ZULB8iCbdvIs16hxf39cxdNhtmwMLkA=.1fa60774-1ebe-428b-9462-5b95f44040f7@github.com> > We develop optimized x86 intrinsics for the floating point class check methods `isNaN()`, `isFinite()` and `IsInfinite()` for Float and Double classes. JMH benchmarks show upto `~70% `improvement using` vfpclasss(s/d)` instructions. > > > Benchmark (ns/op) Baseline Intrinsic(vfpclasss/d) Speedup(%) > FloatClassCheck.testIsFinite 0.562 0.406 28% > FloatClassCheck.testIsInfinite 0.815 0.383 53% > FloatClassCheck.testIsNaN 0.63 0.382 39% > DoubleClassCheck.testIsFinite 0.565 0.409 28% > DoubleClassCheck.testIsInfinite 0.812 0.375 54% > DoubleClassCheck.testIsNaN 0.631 0.38 40% > FPComparison.isFiniteDouble 332.638 272.577 18% > FPComparison.isFiniteFloat 413.217 331.825 20% > FPComparison.isInfiniteDouble 874.897 240.632 72% > FPComparison.isInfiniteFloat 872.279 321.269 63% > FPComparison.isNanDouble 286.566 240.36 16% > FPComparison.isNanFloat 346.123 316.923 8% Srinivas Vamsi Parasa has updated the pull request incrementally with one additional commit since the last revision: update jmh bechmarks and jtreg tests ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8459/files - new: https://git.openjdk.java.net/jdk/pull/8459/files/dca4ec4c..5789d1df Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8459&range=09-10 Stats: 141 lines in 4 files changed: 119 ins; 0 del; 22 mod Patch: https://git.openjdk.java.net/jdk/pull/8459.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8459/head:pull/8459 PR: https://git.openjdk.java.net/jdk/pull/8459 From jbhateja at openjdk.java.net Wed May 25 05:50:23 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 25 May 2022 05:50:23 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v9] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8284960: Review comments resolved. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8425/files - new: https://git.openjdk.java.net/jdk/pull/8425/files/17a0e38c..a2c9673d Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=08 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=07-08 Stats: 110 lines in 7 files changed: 42 ins; 31 del; 37 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Wed May 25 06:20:43 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 25 May 2022 06:20:43 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v7] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: 8283894: Disabling sanity test as per review suggestion. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/6bb6d343..553c3c39 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=06 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=05-06 Stats: 5 lines in 1 file changed: 0 ins; 0 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Wed May 25 06:29:23 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 25 May 2022 06:29:23 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v10] In-Reply-To: References: Message-ID: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: - 8284960: Post merge cleanups. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Review comments resolved. - 8284960: Integrating incremental patches. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. - 8284960: Adding --enable-preview in vectorAPI benchmarks. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - 8284960: Review comments resolution. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 - ... and 10 more: https://git.openjdk.java.net/jdk/compare/742644e2...0f6e1584 ------------- Changes: https://git.openjdk.java.net/jdk/pull/8425/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8425&range=09 Stats: 38021 lines in 228 files changed: 16652 ins; 16924 del; 4445 mod Patch: https://git.openjdk.java.net/jdk/pull/8425.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8425/head:pull/8425 PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Wed May 25 06:29:24 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 25 May 2022 06:29:24 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> Message-ID: <4u_PL8-QxIYVBgJ23LTdoPWZrTKh70K-beMYXOXZgQQ=.a650a280-dc75-46cd-8449-00c38ddd91ea@github.com> On Mon, 23 May 2022 22:17:40 GMT, Vladimir Kozlov wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8284960: Integrating incremental patches. > > src/hotspot/cpu/x86/assembler_x86.cpp line 8173: > >> 8171: >> 8172: void Assembler::vinsertf32x4(XMMRegister dst, XMMRegister nds, XMMRegister src, uint8_t imm8) { >> 8173: assert(VM_Version::supports_evex(), ""); > > Hmm, did we never trigger this wrong assert because the use was guarded by correct check? Yes. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Wed May 25 06:32:19 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 25 May 2022 06:32:19 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v9] In-Reply-To: References: Message-ID: On Wed, 25 May 2022 05:50:23 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8284960: Review comments resolved. Hi @vnkozlov , Your comments have been addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Wed May 25 06:34:10 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Wed, 25 May 2022 06:34:10 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v6] In-Reply-To: References: Message-ID: On Mon, 23 May 2022 22:29:02 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains ten additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Removing CompressExpandSanityTest from problem list. >> - 8283894: Updating test tag spec. >> - 8283894: Review comments resolved. >> - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. >> - 8283894: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Extending IR framework testcase with some functional test points. >> - 8283894: Intrinsify compress and expand bits on x86 > > test/jdk/java/lang/CompressExpandSanityTest.java line 29: > >> 27: * @key randomness >> 28: * @run testng/othervm -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_expand_i,_expand_l,_compress_i,_compress_l CompressExpandSanityTest >> 29: * @run testng CompressExpandSanityTest > > Can we comment out the annotations so this test is not run by default (i don't know of a better way) @PaulSandoz , commented test tag for sanity test ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From xliu at openjdk.java.net Wed May 25 06:40:59 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 25 May 2022 06:40:59 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v3] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 16 additional commits since the last revision: - reimplement process_unstable_ifs - remember UnstableIfTrap in parser. Also add a statistical counter for trivial counter - Merge branch 'master' into JDK-8286104 - bail out a corner case that ifnode postpones fold-compares after loop optimization. - revert code change from 1st revision. - Merge branch 'JDK-8276998' into JDK-8286104 - rule out if a If nodes has 2 branches of unstable_if trap. - change the flag to diagnostic. - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull - fix release build - ... and 6 more: https://git.openjdk.java.net/jdk/compare/13ace249...3eae8c5e ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/2f047457..3eae8c5e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=01-02 Stats: 214960 lines in 2764 files changed: 160786 ins; 40446 del; 13728 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Wed May 25 07:25:58 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Wed, 25 May 2022 07:25:58 GMT Subject: RFR: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization [v2] In-Reply-To: References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> <048Bn62N56IG6Z-1e1POyo-gjogYkq9lDy4I5TQXOLk=.f5a67ec6-22f6-4541-b41b-7920851b1bd5@github.com> Message-ID: On Tue, 24 May 2022 14:11:13 GMT, Vladimir Kozlov wrote: >>> As I commented in bug report, it took 54 (cnt == 52) iterations to finish compilation TestMaxLoopOptsCountReached::test method. Should we just rise default LoopOptsCount flag's value and limit in the assert? >> >> I think the problem is to determine what a good value should be. I'm afraid we could probably come up with some other cases where we can reach the limit again and hit the assert with a larger value. I agree that there might be a problem with `major_progess` and we should not do so many optimizations. But in general, I'm not sure how we can prove that we will only hit the assert in case of a real bug and not just a false positive. Maybe we would need some additional heuristics to make a decision (like handling the live node limit)? This would need some more careful investigation. >> >> The way the assert is currently implemented does not help us. It will never be hit without explicitly disabling partial peeling. So, I think we should either remove it or change its value to 39 to possibly let it fail on the very last loop opts iteration. However, since we already have some different cases where we would fail with 39, this is currently not a good option. But I agree with Vladimir that it would be good to have some mechanism to warn us about such problems in the future. >> >> So, I think there are 2 problems to solve: >> - Investigating the known cases so far and figure out why they need so many loop opts iterations. >> - Make the assert useful again (blocked by the first problem). >> >> I'd suggest to investigate the cases we've found so far and file separate bugs for them if they turn out to be real bugs. Additionally, I suggest the following 2 options: >> - Remove the assert for now (given that it has never worked as it was supposed to work) and file an RFE to check if we can reintroduce it once the known cases/bugs are fixed and we have some confidence to not hit it again (could include tweaking the `LoopOptsCount` max value or introducing some other heuristics to avoid false positives). >> - Keep the assert as we have it today and just defer this bug and treat it like the RFE above. >> >> I would opt for option 1 given that the assert has no real value as of today. >> >> What do you think? >> >> Thanks, >> Christian > >> * Remove the assert for now (given that it has never worked as it was supposed to work) and file an RFE to check if we can reintroduce it once the known cases/bugs are fixed and we have some confidence to not hit it again (could include tweaking the `LoopOptsCount` max value or introducing some other heuristics to avoid false positives). >> * Keep the assert as we have it today and just defer this bug and treat it like the RFE above. >> >> I would opt for option 1 given that the assert has no real value as of today. > > Okay, I agree. Thanks @vnkozlov and @chhagedorn for you feedback. I will file a follow-up RFE to reintroduce a better check for loop optimization progress and a Bug for the infinite loop in loop optimizations. ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From aph at openjdk.java.net Wed May 25 07:30:02 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Wed, 25 May 2022 07:30:02 GMT Subject: Integrated: 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn In-Reply-To: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> References: <2GEDEyjpLiWL1yS00lHxVP8SXrWIIYx07wPZ3xU_yeA=.7f12ae38-3aa7-48d3-ba8d-732e606c470a@github.com> Message-ID: On Mon, 23 May 2022 14:55:34 GMT, Andrew Haley wrote: > This is fallout from the patch for JDK-8285923. > > The root cause of this bug is that there is a template definition of `cmp(register, immediate)` but there is not a template definition of `cmn(register, immediate)`. Given that we are close to rampdown, this patch fixes the bug in the most minimal way possible, by using `adds(zr, register, immediate)`, which correctly handles 64-bit operands. > > In the next release cycle we should tidy up `cmn()` in the same way that was done for JDK-8206895. > > Alternatively, we could back out JDK-8285923. I'd rather not, given that it fixes a real (if latent) bug, but if needs be I'll do so. This pull request has now been integrated. Changeset: 593d2b7d Author: Andrew Haley URL: https://git.openjdk.java.net/jdk/commit/593d2b7dab934875527249be6840f328147b72b3 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8287091: aarch64 : guarantee(val < (1ULL << nbits)) failed: Field too big for insn Reviewed-by: ngasson, shade ------------- PR: https://git.openjdk.java.net/jdk/pull/8845 From chagedorn at openjdk.java.net Wed May 25 08:25:31 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Wed, 25 May 2022 08:25:31 GMT Subject: RFR: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true Message-ID: Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. I propose to get rid of this flag by reworking the Whitebox installation process. Thanks, Christian ------------- Commit messages: - formatting - 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true Changes: https://git.openjdk.java.net/jdk/pull/8879/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8879&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286940 Stats: 36 lines in 5 files changed: 28 ins; 1 del; 7 mod Patch: https://git.openjdk.java.net/jdk/pull/8879.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8879/head:pull/8879 PR: https://git.openjdk.java.net/jdk/pull/8879 From rcastanedalo at openjdk.java.net Wed May 25 09:08:12 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 25 May 2022 09:08:12 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v5] In-Reply-To: References: Message-ID: On Tue, 24 May 2022 16:42:13 GMT, Brian J. Stafford wrote: >> The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: >> >> - Asserted the condition that makes sure the code runs at least once >> - Asserted the condition that makes sure the code runs only once >> - Removed the `while` loop >> - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. >> - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. >> >> Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. > > Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: > > Added some whitespace Thanks for addressing my last comment! ------------- Marked as reviewed by rcastanedalo (Committer). PR: https://git.openjdk.java.net/jdk/pull/8684 From rrich at openjdk.java.net Wed May 25 11:07:24 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Wed, 25 May 2022 11:07:24 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler Message-ID: This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` Testing: jtreg:test/hotspot/jtreg:hotspot_loom jtreg:test/jdk:jdk_loom On linux x86_64 and aarch64. On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. ------------- Commit messages: - Merge branch 'master' into 8287205_remove_dead_code_from_generate_cont_thaw - Remove dead code from generate_cont_thaw Changes: https://git.openjdk.java.net/jdk/pull/8863/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8863&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287205 Stats: 16 lines in 2 files changed: 8 ins; 8 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8863.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8863/head:pull/8863 PR: https://git.openjdk.java.net/jdk/pull/8863 From shade at openjdk.java.net Wed May 25 11:07:24 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 25 May 2022 11:07:24 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler In-Reply-To: References: Message-ID: On Tue, 24 May 2022 07:33:02 GMT, Richard Reingruber wrote: > This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` > > Testing: > > jtreg:test/hotspot/jtreg:hotspot_loom > jtreg:test/jdk:jdk_loom > > On linux x86_64 and aarch64. > > On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. I think this is still `hotspot-compiler`. I also suggest to pull from master and run `make test TEST="hotspot_loom jdk_loom"`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From rrich at openjdk.java.net Wed May 25 11:07:25 2022 From: rrich at openjdk.java.net (Richard Reingruber) Date: Wed, 25 May 2022 11:07:25 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler In-Reply-To: References: Message-ID: On Tue, 24 May 2022 14:15:03 GMT, Aleksey Shipilev wrote: > I think this is still `hotspot-compiler` Ok. I will revert to `hotspot-compiler` > I also suggest to pull from master and run `make test TEST="hotspot_loom jdk_loom"`. Will do. I'll await our nightly tests before marking this pr as ready for review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From shade at openjdk.java.net Wed May 25 11:11:55 2022 From: shade at openjdk.java.net (Aleksey Shipilev) Date: Wed, 25 May 2022 11:11:55 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler In-Reply-To: References: Message-ID: On Tue, 24 May 2022 07:33:02 GMT, Richard Reingruber wrote: > This fix avoids generating unreachable instructions after the jump to the exception handler in `generate_cont_thaw()` > > Testing: > > jtreg:test/hotspot/jtreg:hotspot_loom > jtreg:test/jdk:jdk_loom > > On linux x86_64 and aarch64. > > On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. Looks fine to me. I missed this while refactoring x86_64 generate_cont_thaw :) ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8863 From duke at openjdk.java.net Wed May 25 12:29:03 2022 From: duke at openjdk.java.net (aamarsh) Date: Wed, 25 May 2022 12:29:03 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v12] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 19:21:21 GMT, Vladimir Kozlov wrote: >> aamarsh has updated the pull request incrementally with one additional commit since the last revision: >> >> eliminate trailing whitespace > > It is up to you but I would suggest to remove your change to collect time since we have it already with CITime. @vnkozlov It looks like there are two Windows x64 tests failing and after a re-run they have been stuck on the task for 20 hours. The failure happened at: tools/javac/Paths/MineField.sh tools/javac/Paths/wcMineField.sh with the message: `STDERR: Main.java:1: error: cannot find symbol public class Main {public static void main(String[] a) {Lib.f();}}` It does not appear that my code would effect these tests, but I don't want to integrate if the check are are not all passing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From alanb at openjdk.java.net Wed May 25 12:48:44 2022 From: alanb at openjdk.java.net (Alan Bateman) Date: Wed, 25 May 2022 12:48:44 GMT Subject: RFR: 8287205: generate_cont_thaw generates dead code after jump to exception handler In-Reply-To: References: Message-ID: On Tue, 24 May 2022 07:33:02 GMT, Richard Reingruber wrote: > On aarch64 the test `jdk/java/lang/management/ThreadMXBean/VirtualThreadDeadlocks.java` had a timeout. The aaarch64 machine I used is very slow. This might have caused the timeout. There's an issue with that test, it's tracked by JDK-8287200. Leonid has a PR open to fix it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8863 From rcastanedalo at openjdk.java.net Wed May 25 12:49:55 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Wed, 25 May 2022 12:49:55 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v2] In-Reply-To: References: <9tV7TXcuMDkKb0tIBZnIzzwiAp3knfEnad8t8d5p0Q8=.e7dfb533-e5c1-42ac-b8a6-61ac7b59e1db@github.com> Message-ID: On Thu, 19 May 2022 22:51:26 GMT, Brian J. Stafford wrote: >>> Looks good to me and tests passed. @robcasloz should also have a look. >> >> Running some additional tests, will come back with the results. > > Thank you @robcasloz for the suggestions, hopefully I've incorporated them as you expected. Please let me know if I should make further changes. @brianjstafford feel free to mark this PR as ready for integration (`/integrate`) when you think it is. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Wed May 25 13:51:19 2022 From: duke at openjdk.java.net (Tobias Holenstein) Date: Wed, 25 May 2022 13:51:19 GMT Subject: Integrated: JDK-8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization In-Reply-To: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> References: <6tYUlU6To3dIk5NZNcyu2PI8m72uLsw09qO_5ca4GBY=.97d63197-96e1-4f4f-b854-0d54c1628267@github.com> Message-ID: On Wed, 18 May 2022 12:33:24 GMT, Tobias Holenstein wrote: > `_loop_opts_cnt` is set to `LoopOptsCount` which can have a maximum value of 43. `_loop_opts_cnt` is decremented in `PHASE_PHASEIDEALLOOP1`, `PHASE_PHASEIDEALLOOP2` and `PHASE_PHASEIDEALLOOP3` before it reaches `PHASE_PHASEIDEALLOOP_ITERATIONS` where it is decremented further in a loop until `_loop_opts_cnt` is 0. The assert assumes that `_loop_opts_cnt` has max. value 40 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. But when `PartialPeelLoop` is turned off `PHASE_PHASEIDEALLOOP2` is skipped and `_loop_opts_cnt` can have max. value 41 in `PHASE_PHASEIDEALLOOP_ITERATIONS`. Therefore the assert is wrong. > > I propose to remove the assert entirely since the loop already has a condition `_loop_opts_cnt > 0` and `_loop_opts_cnt` is decremented in every iteration. This pull request has now been integrated. Changeset: 796494d0 Author: Tobias Holenstein Committer: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/796494d0fecfb9587e8b68ff1d5c09411cb82f89 Stats: 127 lines in 2 files changed: 125 ins; 2 del; 0 mod 8284944: assert(cnt++ < 40) failed: infinite cycle in loop optimization Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8767 From duke at openjdk.java.net Wed May 25 15:29:06 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Wed, 25 May 2022 15:29:06 GMT Subject: RFR: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() [v5] In-Reply-To: References: Message-ID: <0lXgMa9Y8UBvYJvY09n8q4IAz4rrabc8qY0TJfuwNi4=.881c7ffe-aabd-4ef1-baf3-d86fbc257983@github.com> On Tue, 24 May 2022 16:42:13 GMT, Brian J. Stafford wrote: >> The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: >> >> - Asserted the condition that makes sure the code runs at least once >> - Asserted the condition that makes sure the code runs only once >> - Removed the `while` loop >> - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. >> - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. >> >> Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. > > Brian J. Stafford has updated the pull request incrementally with one additional commit since the last revision: > > Added some whitespace The langtools/tier1 failures on Windows x64 are not related to this recent whitespace change, and I'm seeing similar failures on other outstanding PRs. Might be an infrastructure issue. However, I'm not positive if the failures will prevent integration. ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From duke at openjdk.java.net Wed May 25 15:54:59 2022 From: duke at openjdk.java.net (duke) Date: Wed, 25 May 2022 15:54:59 GMT Subject: Withdrawn: 8282470: Eliminate useless sign extension before some subword integer operations In-Reply-To: References: Message-ID: On Fri, 25 Mar 2022 07:43:25 GMT, Fei Gao wrote: > Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: > > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // line A > sres[i] = (short) (b[i] + c[i]); // line B > } > } > > However, similar cases of int/float/double/long/char type can be vectorized successfully. > > The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. > ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) > > In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension before some subword integer operations like: > > > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > > to > > short s = (short) (x OP y); > > > In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From kvn at openjdk.java.net Wed May 25 16:28:27 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 25 May 2022 16:28:27 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v9] In-Reply-To: References: Message-ID: <7J9iS711pZ787GWlBS6vmDen-X_YVULXxkR-cPRuYQs=.467502a6-5dd2-40dd-9393-9d685c24e1a1@github.com> On Wed, 25 May 2022 06:29:06 GMT, Jatin Bhateja wrote: >> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: >> >> 8284960: Review comments resolved. > > Hi @vnkozlov , Your comments have been addressed. @jatin-bhateja something wrong with merge. `vpadd()` is removed. It was added by #8778 and still is used in `x86.ad`. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From kvn at openjdk.java.net Wed May 25 17:02:28 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 25 May 2022 17:02:28 GMT Subject: RFR: 8282024: add EscapeAnalysis statistics under PrintOptoStatistics [v12] In-Reply-To: References: Message-ID: On Wed, 25 May 2022 12:24:55 GMT, aamarsh wrote: >> It is up to you but I would suggest to remove your change to collect time since we have it already with CITime. > > @vnkozlov It looks like there are two Windows x64 tests failing and after a re-run they have been stuck on the task for 20 hours. The failure happened at: > > tools/javac/Paths/MineField.sh > tools/javac/Paths/wcMineField.sh > > with the message: > > `STDERR: > Main.java:1: error: cannot find symbol > public class Main {public static void main(String[] a) {Lib.f();}}` > > It does not appear that my code would effect these tests, but I don't want to integrate if the check are are not all passing. @aamarsh Do not worry. I see other latest PRs has the same failure. Your changes should not affect product version of JDK. ------------- PR: https://git.openjdk.java.net/jdk/pull/8019 From kvn at openjdk.java.net Wed May 25 17:24:42 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 25 May 2022 17:24:42 GMT Subject: RFR: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true In-Reply-To: References: Message-ID: On Wed, 25 May 2022 08:17:17 GMT, Christian Hagedorn wrote: > Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: > > Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process > > To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. > > I propose to get rid of this flag by reworking the Whitebox installation process. > > Thanks, > Christian Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8879 From kvn at openjdk.java.net Wed May 25 17:24:51 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 25 May 2022 17:24:51 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v2] In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 08:15:21 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 @fg1417 Thank you for proposing these changes. I hope you still want to proceed with it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From duke at openjdk.java.net Wed May 25 17:32:46 2022 From: duke at openjdk.java.net (Brian J. Stafford) Date: Wed, 25 May 2022 17:32:46 GMT Subject: Integrated: 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() In-Reply-To: References: Message-ID: On Thu, 12 May 2022 16:48:40 GMT, Brian J. Stafford wrote: > The reporter for this issue (https://bugs.openjdk.java.net/browse/JDK-8263075) indicated that there's an assumption that we can rely on that the while loop in question will run exactly one time. Based on this, I've done the following: > > - Asserted the condition that makes sure the code runs at least once > - Asserted the condition that makes sure the code runs only once > - Removed the `while` loop > - Changed a couple of `break` statements into `continue` statements. They no longer need to break out of the `while` loop, now that it's gone. However, they were early exits from the `while` loop that ended up resulting in `continue` statements for the larger enclosing loop. Thus we can just call `continue` directly. > - Removed the local variable `b`, as we no longer need to traverse the node hierarchy. We can use `mb` directly. > > Passes jdk, langtools, and hotspot Tier 1 tests on Linux (x64 and ARM64) and macOS (x64 and ARM64). Most Tier 1 tests pass on Windows (x64 and ARM64), but there are a handful of failures unrelated to this change. This pull request has now been integrated. Changeset: c6743489 Author: Brian J. Stafford Committer: Vladimir Kozlov URL: https://git.openjdk.java.net/jdk/commit/c6743489d2fb65f3fe05b403ae66ac30e6aa4846 Stats: 21 lines in 1 file changed: 2 ins; 0 del; 19 mod 8263075: C2: simplify anti-dependence check in PhaseCFG::implicit_null_check() Reviewed-by: kvn, thartmann, rcastanedalo ------------- PR: https://git.openjdk.java.net/jdk/pull/8684 From psandoz at openjdk.java.net Wed May 25 17:49:19 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Wed, 25 May 2022 17:49:19 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v7] In-Reply-To: References: Message-ID: <3IxIo48KHW-MPCs1br-dyl9Pelp0s9yyNcmdELlaEII=.57ce264e-4c61-4ca1-8951-2e5006194b47@github.com> On Wed, 25 May 2022 06:20:43 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision: > > 8283894: Disabling sanity test as per review suggestion. test/jdk/java/lang/CompressExpandSanityTest.java line 29: > 27: * //@key randomness > 28: * //@run testng/othervm -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_expand_i,_expand_l,_compress_i,_compress_l CompressExpandSanityTest > 29: * //@run testng CompressExpandSanityTest Suggestion: // Disabled by default // @test /* * @summary Test compress expand as if the test methods are the implementation methods * @key randomness * @run testng/othervm -XX:+UnlockDiagnosticVMOptions -XX:DisableIntrinsic=_expand_i,_expand_l,_compress_i,_compress_l CompressExpandSanityTest * @run testng CompressExpandSanityTest */ I think we need to do it like above, otherwise it will induce a test error ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From xliu at openjdk.java.net Wed May 25 19:31:00 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Wed, 25 May 2022 19:31:00 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v4] In-Reply-To: References: Message-ID: <7xo0NJ9ltG3wShCImn_fTnRei0zTluWIJiHDenrMhuE=.5d9da4cc-550d-4531-8e8e-730018ef0879@github.com> > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 17 additional commits since the last revision: - Merge branch 'master' into JDK-8286104 - reimplement process_unstable_ifs - remember UnstableIfTrap in parser. Also add a statistical counter for trivial counter - Merge branch 'master' into JDK-8286104 - bail out a corner case that ifnode postpones fold-compares after loop optimization. - revert code change from 1st revision. - Merge branch 'JDK-8276998' into JDK-8286104 - rule out if a If nodes has 2 branches of unstable_if trap. - change the flag to diagnostic. - add sanity check for operands if bc is if_acmp_eq/ne and ifnull/nonnull - ... and 7 more: https://git.openjdk.java.net/jdk/compare/ffb7cc59...849cdec3 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/3eae8c5e..849cdec3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=03 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=02-03 Stats: 36442 lines in 602 files changed: 8041 ins; 27017 del; 1384 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From vlivanov at openjdk.java.net Wed May 25 22:34:21 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Wed, 25 May 2022 22:34:21 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver Message-ID: Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. Proposed fix bails out inlining attempt when receiver is null constant. C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). Testing: hs-tier1 - hs-tier4 ------------- Commit messages: - Whitespace - 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver Changes: https://git.openjdk.java.net/jdk/pull/8894/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8894&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287223 Stats: 101 lines in 3 files changed: 77 ins; 8 del; 16 mod Patch: https://git.openjdk.java.net/jdk/pull/8894.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8894/head:pull/8894 PR: https://git.openjdk.java.net/jdk/pull/8894 From kvn at openjdk.java.net Wed May 25 23:04:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Wed, 25 May 2022 23:04:35 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver In-Reply-To: References: Message-ID: On Wed, 25 May 2022 21:58:01 GMT, Vladimir Ivanov wrote: > Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. > > Proposed fix bails out inlining attempt when receiver is null constant. > > C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). > > Testing: hs-tier1 - hs-tier4 src/hotspot/share/opto/callGenerator.cpp line 1021: > 1019: if (receiver->Opcode() == Op_ConP) { > 1020: input_not_const = false; > 1021: ciObject* recv_obj = receiver->bottom_type()->is_oopptr()->const_oop(); In general ConP could be `NULL_PTR` (AnyPtr). I think you need to use `isa_oopptr()` and check for `NULL` here. Or it can't be `NULL_PTR` in this case? ------------- PR: https://git.openjdk.java.net/jdk/pull/8894 From dholmes at openjdk.java.net Thu May 26 02:52:37 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 26 May 2022 02:52:37 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 In-Reply-To: References: Message-ID: On Wed, 25 May 2022 09:11:23 GMT, Zhuojun Miao wrote: > This is a trivial patch to fix some typos in C1. > e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 Adding hotspot-compiler-dev to see if anyone there would like to chime in. ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From dholmes at openjdk.java.net Thu May 26 02:52:40 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Thu, 26 May 2022 02:52:40 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 In-Reply-To: <4aStkDt-jXraLpwGg7SrudAS_S7WDQq0Pz62dRt9MEo=.38b77871-4a77-4af0-ac09-f44df6ff475f@github.com> References: <4aStkDt-jXraLpwGg7SrudAS_S7WDQq0Pz62dRt9MEo=.38b77871-4a77-4af0-ac09-f44df6ff475f@github.com> Message-ID: On Thu, 26 May 2022 02:15:08 GMT, Zhuojun Miao wrote: >> src/hotspot/share/c1/c1_LIR.hpp line 203: >> >>> 201: // data opr-type opr-kind >>> 202: // +-----------+----------+-------+ >>> 203: // [max........|6 5 4 3|2 1 0] >> >> I don't think your change is right. The code below indicates there are 3 bits for kind. So I think the original should actually look like this: >> >> >> // data opr-type opr-kind >> // +--------------+-------+--------+--+ >> // [max...........|7 6 5 4| 3 2 1 | 0] > > According to the definitions of `Opr_kind` and `kind_mask` below, it can be seen that `opr-kind` uses the lowest 3 bits, and if the lowest bit is 0, it means that this is a pointer. I was reading: enum OprBits { pointer_bits = 1 , kind_bits = 3 , type_bits = 4 , size_bits = 2 , destroys_bits = 1 as a bitfield description so 1 pointer bit followed by 3 kind bits, followed by 4 type bits etc. But I agree the later mask value only does a shift of 3 not 4. Also the code indicates there are a lot more non-data bits before we get to the data so the diagram is incomplete in other ways. ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From zmiao at openjdk.java.net Thu May 26 03:39:41 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Thu, 26 May 2022 03:39:41 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 In-Reply-To: References: <4aStkDt-jXraLpwGg7SrudAS_S7WDQq0Pz62dRt9MEo=.38b77871-4a77-4af0-ac09-f44df6ff475f@github.com> Message-ID: On Thu, 26 May 2022 02:47:54 GMT, David Holmes wrote: >> According to the definitions of `Opr_kind` and `kind_mask` below, it can be seen that `opr-kind` uses the lowest 3 bits, and if the lowest bit is 0, it means that this is a pointer. > > I was reading: > > enum OprBits { > pointer_bits = 1 > , kind_bits = 3 > , type_bits = 4 > , size_bits = 2 > , destroys_bits = 1 > > as a bitfield description so 1 pointer bit followed by 3 kind bits, followed by 4 type bits etc. But I agree the later mask value only does a shift of 3 not 4. > > Also the code indicates there are a lot more non-data bits before we get to the data so the diagram is incomplete in other ways. I don't think `pointer_bits` should be added to `non_data_bits`: , non_data_bits = pointer_bits + kind_bits + type_bits + size_bits + destroys_bits + virtual_bits + is_xmm_bits + last_use_bits + is_fpu_stack_offset_bits , data_bits = BitsPerInt - non_data_bits , reg_bits = data_bits / 2 // for two registers in one value encoding ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From fgao at openjdk.java.net Thu May 26 06:18:33 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 26 May 2022 06:18:33 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: > Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: > > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // line A > sres[i] = (short) (b[i] + c[i]); // line B > } > } > > However, similar cases of int/float/double/long/char type can be vectorized successfully. > > The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. > ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) > > In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension before some subword integer operations like: > > > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > > to > > short s = (short) (x OP y); > > > In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into fg8282470 Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b - Merge branch 'master' into fg8282470 Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 - 8282470: Eliminate useless sign extension before some subword integer operations Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: ``` short[] addShort(short[] a, short[] b, short[] c) { for (int i = 0; i < SIZE; i++) { b[i] = (short) (a[i] + 8); // *line A* sres[i] = (short) (b[i] + c[i]); // *line B* } } ``` However, similar cases of int/float/double/long/char type can be vectorized successfully. The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. LoadS a[i] 8 \ / AddI (line A) / \ StoreC b[i] Lshift 16bits \ RShiftI 16 bits LoadS c[i] \ / AddI (line B) \ StoreC sres[i] In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. What we do in the mid-end is eliminating the sign extension before some subword integer operations like: ``` int x, y; short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 ``` to ``` short s = (short) (x OP y); ``` In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. After the patch, the short loop case above can be vectorized as: ``` movi v18.8h, #0x8 ... ldr q16, [x14, #32] // vector load a[i] // vector add, a[i] + 8, no promotion or demotion add v17.8h, v16.8h, v18.8h str q17, [x6, #32] // vector store a[i] + 8, b[i] ldr q17, [x0, #32] // vector load c[i] // vector add, a[i] + c[i], no promotion or demotion add v16.8h, v17.8h, v16.8h // vector add, a[i] + c[i] + 8, no promotion or demotion add v16.8h, v16.8h, v18.8h str q16, [x11, #32] //vector store sres[i] ... ``` The patch works for byte cases as well. Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. on AArch64: Before the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 401.521 ? 0.033 ns/op addS 523 avgt 15 401.512 ? 0.021 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 68.444 ? 0.318 ns/op addS 523 avgt 15 69.847 ? 0.043 ns/op on x86: Before the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 454.102 ? 36.180 ns/op addS 523 avgt 15 432.245 ? 22.640 ns/op After the patch: Benchmark (length) Mode Cnt Score Error Units addB 523 avgt 15 75.812 ? 5.063 ns/op addS 523 avgt 15 72.839 ? 10.109 ns/op [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/7954/files - new: https://git.openjdk.java.net/jdk/pull/7954/files/863b2a1a..1a5ecbd8 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=7954&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=7954&range=01-02 Stats: 281001 lines in 3827 files changed: 189607 ins; 71760 del; 19634 mod Patch: https://git.openjdk.java.net/jdk/pull/7954.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/7954/head:pull/7954 PR: https://git.openjdk.java.net/jdk/pull/7954 From fgao at openjdk.java.net Thu May 26 06:20:28 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Thu, 26 May 2022 06:20:28 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v2] In-Reply-To: References: Message-ID: On Wed, 25 May 2022 17:14:46 GMT, Vladimir Kozlov wrote: > @fg1417 Thank you for proposing these changes. I hope you still want to proceed with it. I updated the patch to latest JDK. Could you please help review it, @vnkozlov ? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From jbhateja at openjdk.java.net Thu May 26 06:22:09 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Thu, 26 May 2022 06:22:09 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: <4u_PL8-QxIYVBgJ23LTdoPWZrTKh70K-beMYXOXZgQQ=.a650a280-dc75-46cd-8449-00c38ddd91ea@github.com> References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> <4u_PL8-QxIYVBgJ23LTdoPWZrTKh70K-beMYXOXZgQQ=.a650a280-dc75-46cd-8449-00c38ddd91ea@github.com> Message-ID: On Wed, 25 May 2022 06:25:53 GMT, Jatin Bhateja wrote: >> src/hotspot/cpu/x86/assembler_x86.cpp line 8173: >> >>> 8171: >>> 8172: void Assembler::vinsertf32x4(XMMRegister dst, XMMRegister nds, XMMRegister src, uint8_t imm8) { >>> 8173: assert(VM_Version::supports_evex(), ""); >> >> Hmm, did we never trigger this wrong assert because the use was guarded by correct check? > > Yes. > @jatin-bhateja something wrong with merge. `vpadd()` is removed. It was added by #8778 and still is used in `x86.ad`. Hi @vnkozlov , after integration of PR 8778 there were there were two copies of vpadd with same signature, so removed one of them. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From xliu at openjdk.java.net Thu May 26 06:28:35 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 26 May 2022 06:28:35 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v5] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 19 additional commits since the last revision: - Merge branch 'master' into JDK-8286104 - update comments. - Merge branch 'master' into JDK-8286104 - reimplement process_unstable_ifs - remember UnstableIfTrap in parser. Also add a statistical counter for trivial counter - Merge branch 'master' into JDK-8286104 - bail out a corner case that ifnode postpones fold-compares after loop optimization. - revert code change from 1st revision. - Merge branch 'JDK-8276998' into JDK-8286104 - rule out if a If nodes has 2 branches of unstable_if trap. - ... and 9 more: https://git.openjdk.java.net/jdk/compare/01d2c486...a30e0ef3 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/849cdec3..a30e0ef3 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=04 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=03-04 Stats: 152 lines in 8 files changed: 124 ins; 18 del; 10 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Thu May 26 06:45:32 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 26 May 2022 06:45:32 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Tue, 17 May 2022 19:03:04 GMT, Vladimir Kozlov wrote: > I think you also need to call inline_incrementally_cleanup() on exit from process_for_unstable_ifs() if it made progress (found dead locals). Or do similar thing to cleanup *_late_inlines lists. Placing IF node on work list is not enough. Hi, @vnkozlov , I don't understand why we need to call inline_incrementally_cleanup(), which combines `PhaseRemoveUseless` and IGVN. It looks for me C2 only needs `PhaseRemoveUseless` after parsing. I decide to only call process_for_unstable_ifs() after IGVN(1st) and before incremental inliner. we may call it after incremental inliner after we can prove it is stable. If `process_for_unstable_ifs` does kill a value, `igvn.replace_input_of()` puts unc and unc->in(idx) into worklist. `igvn.optimize()` in last statement should handle them properly. I just follow the recipe of `process_for_post_loop_opts_igvn()`. thanks, --lx ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From xliu at openjdk.java.net Thu May 26 07:00:37 2022 From: xliu at openjdk.java.net (Xin Liu) Date: Thu, 26 May 2022 07:00:37 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v6] In-Reply-To: References: Message-ID: > I found that some phi nodes are useful because its uses are uncommon_trap nodes. In worse case, it would hinder boxing object eliminating and scalar replacement. Test.java of JDK-8286104 reveals this issue. This patch allows c2 parser to collect liveness based on next bci for unstable_if traps. In most cases, next bci is closer to exits, so live locals are diminishing. It helps to reduce the number of inputs of unstable_if traps. > > This is not a REDO of Optimization of Box nodes in uncommon_trap(JDK-8261137). Two patches are orthogonal. I adapt test from [TestEliminateBoxInDebugInfo.java](https://github.com/openjdk/jdk/pull/2401/files#diff-49b2e38825aa4c28ca196bdc70c3cbecc2e835c2899f4f393527df4796b177ea), so part of credit goes to the original author. I found that Scalar replacement can take care of the object `Integer ii = Integer.valueOf(value)` in original test even it can't be removed by later inliner. I tweak the profiling data of Integer.valueOf() to hinder scalar replacement. > > This patch can cover the problem discovered by JDK-8261137. I ran the microbench and got 9x speedup on x86_64. It's almost same as JDK-8261137. > > Before: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 32.776 ? 0.075 us/op > > After: > > Benchmark Mode Cnt Score Error Units > MyBenchmark.testMethod avgt 10 3.656 ? 0.006 us/op > ``` > > Testing > I ran tier1 test with and without `-XX:+DeoptimizeALot`. No regression has been found yet. Xin Liu has updated the pull request incrementally with one additional commit since the last revision: support option AggressiveLivessForUnstableIf ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8545/files - new: https://git.openjdk.java.net/jdk/pull/8545/files/a30e0ef3..4fdd1c88 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=05 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8545&range=04-05 Stats: 6 lines in 1 file changed: 5 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8545.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8545/head:pull/8545 PR: https://git.openjdk.java.net/jdk/pull/8545 From duke at openjdk.java.net Thu May 26 10:33:46 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Thu, 26 May 2022 10:33:46 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v11] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with five additional commits since the last revision: - remove Node::related, dump_related, dump_related_compact - fixing some comments, fix alignment, don't traverse over root or constants - align (and optionally color) dump - filtering extended to visit/boundary query pattern - refactored into pipeline. sort for input/output. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/5b0f43e6..d78fb82e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=10 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=09-10 Stats: 681 lines in 13 files changed: 239 ins; 323 del; 119 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From ngasson at openjdk.java.net Thu May 26 19:46:57 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Thu, 26 May 2022 19:46:57 GMT Subject: RFR: 8287195: AArch64: Client VM build failure after JDK-8283689 Message-ID: The client build fails because foreignGlobals_aarch64.cpp uses Matcher without `#ifdef COMPILER2`. However there's a latent bug here on SVE machines where `RegSpiller::pd_reg_size()` returns the SVE register size but other code that uses the register spill area (e.g. `DowncallStubGenerator::generate()` and `AArch64Architecture.VECTOR_REG_SIZE`) assume that we always save 16 bytes. Since we don't support any calling conventions where arguments/results are passed in long vectors we should just save the first 128 bits of the register, like x86 does. ------------- Commit messages: - 8287195: AArch64: Client VM build failure after JDK-8283689 Changes: https://git.openjdk.java.net/jdk/pull/8908/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8908&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287195 Stats: 18 lines in 1 file changed: 0 ins; 15 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8908.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8908/head:pull/8908 PR: https://git.openjdk.java.net/jdk/pull/8908 From kvn at openjdk.java.net Thu May 26 20:18:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 26 May 2022 20:18:40 GMT Subject: RFR: 8286104: use aggressive liveness for unstable_if traps [v2] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:42:09 GMT, Xin Liu wrote: > > I think you also need to call inline_incrementally_cleanup() on exit from process_for_unstable_ifs() if it made progress (found dead locals). Or do similar thing to cleanup *_late_inlines lists. Placing IF node on work list is not enough. > > Hi, @vnkozlov , I don't understand why we need to call inline_incrementally_cleanup(), which combines `PhaseRemoveUseless` and IGVN. It looks for me C2 only needs `PhaseRemoveUseless` after parsing. My concern was that `igvn.optimize()` call in `process_for_unstable_ifs()` will not clean up lists (`_late_inlines`, `_string_late_inlines`, `_boxing_late_inlines`) used by `inline_incrementally()` and following code. It is in case when `process_for_unstable_ifs()` removes uses to these calls (removes reference to result of call). You need to run `PhaseRemoveUseless` to determine that calls are useless. I don't think `igvn.optimize()` can mark call node as dead in this case because it still connected to graph through control edges. You need to verify that calls are removed from lists (for example call `valueOf()` from `_boxing_late_inlines` list). If calls are indeed removed from lists by calling `igvn.optimize()` then I will be fine with your code. > > I decide to only call process_for_unstable_ifs() after IGVN(1st) and before incremental inliner. we may call it after incremental inliner after we can prove it is stable. Okay. > > If `process_for_unstable_ifs` does kill a value, `igvn.replace_input_of()` puts unc and unc->in(idx) into worklist. `igvn.optimize()` in last statement should handle them properly. I just follow the recipe of `process_for_post_loop_opts_igvn()`. `process_for_post_loop_opts_igvn()` is called when all late inlines are done and corresponding lists are empty. > > thanks, --lx Thanks, Vladimir K ------------- PR: https://git.openjdk.java.net/jdk/pull/8545 From kvn at openjdk.java.net Thu May 26 21:04:14 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 26 May 2022 21:04:14 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v10] In-Reply-To: References: Message-ID: On Wed, 25 May 2022 06:29:23 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - 8284960: Post merge cleanups. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Review comments resolved. > - 8284960: Integrating incremental patches. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. > - 8284960: Adding --enable-preview in vectorAPI benchmarks. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Review comments resolution. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - ... and 10 more: https://git.openjdk.java.net/jdk/compare/742644e2...0f6e1584 Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8425 From kvn at openjdk.java.net Thu May 26 21:04:15 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Thu, 26 May 2022 21:04:15 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v8] In-Reply-To: References: <-gYfiftVAdAUo-yZv2Y04HhoT7JT5lDcjDjCZ0UvSVc=.aa9d454d-3d6a-458a-997e-9a83951a8fa6@github.com> <4u_PL8-QxIYVBgJ23LTdoPWZrTKh70K-beMYXOXZgQQ=.a650a280-dc75-46cd-8449-00c38ddd91ea@github.com> Message-ID: On Thu, 26 May 2022 06:19:40 GMT, Jatin Bhateja wrote: >> Yes. > >> @jatin-bhateja something wrong with merge. `vpadd()` is removed. It was added by #8778 and still is used in `x86.ad`. > > Hi @vnkozlov , after integration of PR 8778 there were there were two copies of vpadd with same signature, so removed one of them. Okay. Got it. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From jvernee at openjdk.java.net Thu May 26 21:46:27 2022 From: jvernee at openjdk.java.net (Jorn Vernee) Date: Thu, 26 May 2022 21:46:27 GMT Subject: RFR: 8287195: AArch64: Client VM build failure after JDK-8283689 In-Reply-To: References: Message-ID: On Thu, 26 May 2022 19:39:42 GMT, Nick Gasson wrote: > The client build fails because foreignGlobals_aarch64.cpp uses Matcher without `#ifdef COMPILER2`. However there's a latent bug here on SVE machines where `RegSpiller::pd_reg_size()` returns the SVE register size but other code that uses the register spill area (e.g. `DowncallStubGenerator::generate()` and `AArch64Architecture.VECTOR_REG_SIZE`) assume that we always save 16 bytes. Since we don't support any calling conventions where arguments/results are passed in long vectors we should > just save the first 128 bits of the register, like x86 does. Marked as reviewed by jvernee (Reviewer). Thanks for fixing. ------------- PR: https://git.openjdk.java.net/jdk/pull/8908 From dlong at openjdk.java.net Thu May 26 22:12:27 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 26 May 2022 22:12:27 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 In-Reply-To: References: Message-ID: On Wed, 25 May 2022 09:11:23 GMT, Zhuojun Miao wrote: > This is a trivial patch to fix some typos in C1. > e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 I don't think the non_data_bits issue should be addressed as part of this typo fix. Please file a separate bug for that. ------------- Marked as reviewed by dlong (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8880 From dlong at openjdk.java.net Thu May 26 22:12:27 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 26 May 2022 22:12:27 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 In-Reply-To: References: <4aStkDt-jXraLpwGg7SrudAS_S7WDQq0Pz62dRt9MEo=.38b77871-4a77-4af0-ac09-f44df6ff475f@github.com> Message-ID: On Thu, 26 May 2022 03:36:11 GMT, Zhuojun Miao wrote: >> I was reading: >> >> enum OprBits { >> pointer_bits = 1 >> , kind_bits = 3 >> , type_bits = 4 >> , size_bits = 2 >> , destroys_bits = 1 >> >> as a bitfield description so 1 pointer bit followed by 3 kind bits, followed by 4 type bits etc. But I agree the later mask value only does a shift of 3 not 4. >> >> Also the code indicates there are a lot more non-data bits before we get to the data so the diagram is incomplete in other ways. > > I don't think `pointer_bits` should be added to `non_data_bits`: > > > , non_data_bits = pointer_bits + kind_bits + type_bits + size_bits + destroys_bits + virtual_bits > + is_xmm_bits + last_use_bits + is_fpu_stack_offset_bits > , data_bits = BitsPerInt - non_data_bits > , reg_bits = data_bits / 2 // for two registers in one value encoding I agree. This was changed as part of JDK-8261235. @chhagedorn, I think there must be a different issue with vreg_max in JDK-8261235. I wounder if it should be using reg_bits instead of data_bits. ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From dlong at openjdk.java.net Thu May 26 23:07:33 2022 From: dlong at openjdk.java.net (Dean Long) Date: Thu, 26 May 2022 23:07:33 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 In-Reply-To: References: Message-ID: On Wed, 25 May 2022 09:11:23 GMT, Zhuojun Miao wrote: > This is a trivial patch to fix some typos in C1. > e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 I filed JDK-8287396 for the non_data_bits issue. ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From duke at openjdk.java.net Thu May 26 23:21:43 2022 From: duke at openjdk.java.net (duke) Date: Thu, 26 May 2022 23:21:43 GMT Subject: Withdrawn: 8282182: Document algorithm used to encode aarch64 logical immediate operands. In-Reply-To: References: Message-ID: <9IKupK8c2tDhcoI6RfmyAl4ldspP0l7o2hWQ3i967qM=.075b96b3-13c9-4b07-8d69-33485c5258d9@github.com> On Fri, 18 Feb 2022 17:12:08 GMT, Andrew Dinn wrote: > This *documentation only* change explains how logical immediate mask values are derived from valid logical instruction operands. The encoding function is used to populate a sparse array that maps valid masks to a unique set of input operand values and a reverse lookup array that maps inputs to the associated mask. This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.java.net/jdk/pull/7536 From kvn at openjdk.java.net Fri May 27 00:30:50 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 00:30:50 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 Looks reasonable. Let me test it. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From kvn at openjdk.java.net Fri May 27 00:40:31 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 00:40:31 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 Note, these changes need second review (it is not trivial). ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From dlong at openjdk.java.net Fri May 27 03:31:55 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 27 May 2022 03:31:55 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number Message-ID: This PR does two things: - reverts the incorrect change to non_data_bits that included pointer_bits - treats the data() as an unsigned int to prevent a high bit being treated as a negative number ------------- Commit messages: - remove pointer_bits from non_data_bits Changes: https://git.openjdk.java.net/jdk/pull/8912/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8912&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287396 Stats: 6 lines in 1 file changed: 0 ins; 1 del; 5 mod Patch: https://git.openjdk.java.net/jdk/pull/8912.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8912/head:pull/8912 PR: https://git.openjdk.java.net/jdk/pull/8912 From kvn at openjdk.java.net Fri May 27 04:17:34 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 04:17:34 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number In-Reply-To: References: Message-ID: <8IPJd2xLcoUi99wxpiURdovkBxgGcDwgupCn__4dndg=.307fd3dd-3dee-401c-8f50-8adfb759f3ca@github.com> On Fri, 27 May 2022 03:23:47 GMT, Dean Long wrote: > This PR does two things: > - reverts the incorrect change to non_data_bits that included pointer_bits > - treats the data() as an unsigned int to prevent a high bit being treated as a negative number Looks good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8912 From xlinzheng at openjdk.java.net Fri May 27 04:46:46 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 27 May 2022 04:46:46 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr Message-ID: Hi team, `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. To deal with this, there are two approaches: (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. Thanks, Xiaolin [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 [3] https://cdimage.ubuntu.com/releases/22.04/release/ [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 ------------- Commit messages: - Fix MSB overflow in signed lui operations Changes: https://git.openjdk.java.net/jdk/pull/8913/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8913&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287418 Stats: 19 lines in 5 files changed: 1 ins; 0 del; 18 mod Patch: https://git.openjdk.java.net/jdk/pull/8913.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8913/head:pull/8913 PR: https://git.openjdk.java.net/jdk/pull/8913 From zmiao at openjdk.java.net Fri May 27 06:13:35 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Fri, 27 May 2022 06:13:35 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 [v2] In-Reply-To: References: Message-ID: > This is a trivial patch to fix some typos in C1. > e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: - Complete the diagram with other non-data bits - JDK-8287288: Fix some typos in C1 ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8880/files - new: https://git.openjdk.java.net/jdk/pull/8880/files/dc7e9dd7..2e385bcb Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8880&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8880&range=00-01 Stats: 3834 lines in 120 files changed: 3154 ins; 248 del; 432 mod Patch: https://git.openjdk.java.net/jdk/pull/8880.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8880/head:pull/8880 PR: https://git.openjdk.java.net/jdk/pull/8880 From fjiang at openjdk.java.net Fri May 27 06:44:43 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Fri, 27 May 2022 06:44:43 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr In-Reply-To: References: Message-ID: On Fri, 27 May 2022 04:37:01 GMT, Xiaolin Zheng wrote: > Hi team, > > `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. > > `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. > Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. > > To deal with this, there are two approaches: > > (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) > > (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) > > This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. > > Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. > Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. > > Thanks, > Xiaolin > > [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 > [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 > [3] https://cdimage.ubuntu.com/releases/22.04/release/ > [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 Changes requested by fjiang (Author). src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1187: > 1185: int64_t upper = ((intptr_t)target - lower) >> 29; > 1186: Assembler::patch(branch + 0, 31, 12, upper & 0xfffff); // Lui. target[47:28] + target[27] ==> branch[31:12] > 1187: Assembler::patch(branch + 4, 31, 20, (lower >> 17) & 0xfff); // Addiw. target[27:16] ==> branch[31:20] `Addiw` -> `Addi` ------------- PR: https://git.openjdk.java.net/jdk/pull/8913 From xlinzheng at openjdk.java.net Fri May 27 06:54:27 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 27 May 2022 06:54:27 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v2] In-Reply-To: References: Message-ID: <4ZWdHk5sOewdPNA1_Veskubb-dHc4g9DhrBd8boMX-o=.bf87d51e-e2af-431d-bd04-c25f443a0e0a@github.com> > Hi team, > > `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. > > `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. > Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. > > To deal with this, there are two approaches: > > (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) > > (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) > > This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. > > Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. > Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. > > Thanks, > Xiaolin > > [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 > [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 > [3] https://cdimage.ubuntu.com/releases/22.04/release/ > [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Fix a typo in comments ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8913/files - new: https://git.openjdk.java.net/jdk/pull/8913/files/38bc9264..3268630c Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8913&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8913&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8913.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8913/head:pull/8913 PR: https://git.openjdk.java.net/jdk/pull/8913 From yadongwang at openjdk.java.net Fri May 27 06:57:36 2022 From: yadongwang at openjdk.java.net (Yadong Wang) Date: Fri, 27 May 2022 06:57:36 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v2] In-Reply-To: <4ZWdHk5sOewdPNA1_Veskubb-dHc4g9DhrBd8boMX-o=.bf87d51e-e2af-431d-bd04-c25f443a0e0a@github.com> References: <4ZWdHk5sOewdPNA1_Veskubb-dHc4g9DhrBd8boMX-o=.bf87d51e-e2af-431d-bd04-c25f443a0e0a@github.com> Message-ID: On Fri, 27 May 2022 06:54:27 GMT, Xiaolin Zheng wrote: >> Hi team, >> >> `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. >> >> `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. >> Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. >> >> To deal with this, there are two approaches: >> >> (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) >> >> (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) >> >> This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. >> >> Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. >> Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. >> >> Thanks, >> Xiaolin >> >> [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 >> [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 >> [3] https://cdimage.ubuntu.com/releases/22.04/release/ >> [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Fix a typo in comments lgtm ------------- Marked as reviewed by yadongwang (Author). PR: https://git.openjdk.java.net/jdk/pull/8913 From dholmes at openjdk.java.net Fri May 27 06:59:42 2022 From: dholmes at openjdk.java.net (David Holmes) Date: Fri, 27 May 2022 06:59:42 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 [v2] In-Reply-To: References: Message-ID: <4gS045MAg0SY1UDFBb1DpCqrRSCpjEyQl25H0MPpEMs=.fd950026-0d72-4f71-9dd0-5de0a6dc199a@github.com> On Fri, 27 May 2022 06:13:35 GMT, Zhuojun Miao wrote: >> This is a trivial patch to fix some typos in C1. >> e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 > > Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Complete the diagram with other non-data bits > - JDK-8287288: Fix some typos in C1 I'm okay with this as-is as we risk going too far for a "typo" cleanup. Thanks. @miaozhuojun please do not force-push to an active PR as it destroys the context for previous review comments. You can simply merge and push the merge changeset to the PR, it will all be flattened to a single commit when integrated. Thank you. ------------- Marked as reviewed by dholmes (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8880 From xlinzheng at openjdk.java.net Fri May 27 07:03:21 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 27 May 2022 07:03:21 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v2] In-Reply-To: References: Message-ID: On Fri, 27 May 2022 06:41:07 GMT, Feilong Jiang wrote: >> Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: >> >> Fix a typo in comments > > src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1187: > >> 1185: int64_t upper = ((intptr_t)target - lower) >> 29; >> 1186: Assembler::patch(branch + 0, 31, 12, upper & 0xfffff); // Lui. target[47:28] + target[27] ==> branch[31:12] >> 1187: Assembler::patch(branch + 4, 31, 20, (lower >> 17) & 0xfff); // Addiw. target[27:16] ==> branch[31:20] > > `Addiw` -> `Addi` > > Are these comments still right? > > target[27:16] ==> branch[31:20] > target[15: 5] ==> branch[31:20] > target[ 4: 0] ==> branch[31:20] Oh, thank you. My IDE prevents me from seeing the latter comments (the lines are a bit long). I would change that. ------------- PR: https://git.openjdk.java.net/jdk/pull/8913 From xlinzheng at openjdk.java.net Fri May 27 07:24:32 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 27 May 2022 07:24:32 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v3] In-Reply-To: References: Message-ID: > Hi team, > > `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. > > `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. > Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. > > To deal with this, there are two approaches: > > (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) > > (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) > > This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. > > Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. > Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. > > Thanks, > Xiaolin > > [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 > [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 > [3] https://cdimage.ubuntu.com/releases/22.04/release/ > [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: Fix comments in `patch_addr_in_movptr` after this change ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8913/files - new: https://git.openjdk.java.net/jdk/pull/8913/files/3268630c..9d45be1b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8913&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8913&range=01-02 Stats: 4 lines in 1 file changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8913.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8913/head:pull/8913 PR: https://git.openjdk.java.net/jdk/pull/8913 From xlinzheng at openjdk.java.net Fri May 27 07:24:49 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 27 May 2022 07:24:49 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v3] In-Reply-To: References: Message-ID: <3rGKkasj2xCO49abJu6SZdfv48q6bvjWWJputr0rL3Y=.023888af-fd85-4d2a-9644-bb4d83a03224@github.com> On Fri, 27 May 2022 06:59:04 GMT, Xiaolin Zheng wrote: >> src/hotspot/cpu/riscv/macroAssembler_riscv.cpp line 1187: >> >>> 1185: int64_t upper = ((intptr_t)target - lower) >> 29; >>> 1186: Assembler::patch(branch + 0, 31, 12, upper & 0xfffff); // Lui. target[47:28] + target[27] ==> branch[31:12] >>> 1187: Assembler::patch(branch + 4, 31, 20, (lower >> 17) & 0xfff); // Addiw. target[27:16] ==> branch[31:20] >> >> `Addiw` -> `Addi` >> >> Are these comments still right? >> >> target[27:16] ==> branch[31:20] >> target[15: 5] ==> branch[31:20] >> target[ 4: 0] ==> branch[31:20] > > Oh, thank you. My IDE prevents me from seeing the latter comments (the lines are a bit long). I would change that. Comments fixed. Would you please have another review of the change? ------------- PR: https://git.openjdk.java.net/jdk/pull/8913 From fjiang at openjdk.java.net Fri May 27 07:36:45 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Fri, 27 May 2022 07:36:45 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v3] In-Reply-To: References: Message-ID: On Fri, 27 May 2022 07:24:32 GMT, Xiaolin Zheng wrote: >> Hi team, >> >> `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. >> >> `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. >> Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. >> >> To deal with this, there are two approaches: >> >> (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) >> >> (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) >> >> This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. >> >> Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. >> Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. >> >> Thanks, >> Xiaolin >> >> [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 >> [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 >> [3] https://cdimage.ubuntu.com/releases/22.04/release/ >> [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Fix comments in `patch_addr_in_movptr` after this change New changes look good, thanks. ------------- Marked as reviewed by fjiang (Author). PR: https://git.openjdk.java.net/jdk/pull/8913 From fjiang at openjdk.java.net Fri May 27 07:36:47 2022 From: fjiang at openjdk.java.net (Feilong Jiang) Date: Fri, 27 May 2022 07:36:47 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v3] In-Reply-To: <3rGKkasj2xCO49abJu6SZdfv48q6bvjWWJputr0rL3Y=.023888af-fd85-4d2a-9644-bb4d83a03224@github.com> References: <3rGKkasj2xCO49abJu6SZdfv48q6bvjWWJputr0rL3Y=.023888af-fd85-4d2a-9644-bb4d83a03224@github.com> Message-ID: On Fri, 27 May 2022 07:21:11 GMT, Xiaolin Zheng wrote: >> Oh, thank you. My IDE prevents me from seeing the latter comments (the lines are a bit long). I would change that. > > Comments fixed. Would you please have another review of the change? Looks good now, thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8913 From dlong at openjdk.java.net Fri May 27 07:37:51 2022 From: dlong at openjdk.java.net (Dean Long) Date: Fri, 27 May 2022 07:37:51 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number In-Reply-To: References: Message-ID: On Fri, 27 May 2022 03:23:47 GMT, Dean Long wrote: > This PR does two things: > - reverts the incorrect change to non_data_bits that included pointer_bits > - treats the data() as an unsigned int to prevent a high bit being treated as a negative number I got one timeout in the new test compiler/c1/TestTooManyVirtualRegistersMain.java from JDK-8261235. My change doubles the maximum number of virtual registers allowed, from 0x20000 to 0x40000. That probably explains why the test runs so long. Both numbers seem insanely large. Maybe we should set vreg_max to something more reasonable? ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From zmiao at openjdk.java.net Fri May 27 08:52:41 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Fri, 27 May 2022 08:52:41 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 [v2] In-Reply-To: References: Message-ID: On Wed, 25 May 2022 11:02:59 GMT, Andrew Haley wrote: >> Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains two commits: >> >> - Complete the diagram with other non-data bits >> - JDK-8287288: Fix some typos in C1 > > Marked as reviewed by aph (Reviewer). Thanks @theRealAph @dholmes-ora and @dean-long for the review. ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From xlinzheng at openjdk.java.net Fri May 27 09:26:00 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Fri, 27 May 2022 09:26:00 GMT Subject: RFR: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path Message-ID: Hi team, ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) Some perf results show necessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. I was wondering if this minor change could be sponsored? This enhancement is raised on behalf of Wei Kuai . Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. Thanks, Xiaolin [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 ------------- Commit messages: - Enhancement Changes: https://git.openjdk.java.net/jdk/pull/8915/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8915&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287425 Stats: 3 lines in 3 files changed: 0 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8915.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8915/head:pull/8915 PR: https://git.openjdk.java.net/jdk/pull/8915 From adinn at openjdk.java.net Fri May 27 10:09:54 2022 From: adinn at openjdk.java.net (Andrew Dinn) Date: Fri, 27 May 2022 10:09:54 GMT Subject: RFR: 8287195: AArch64: Client VM build failure after JDK-8283689 In-Reply-To: References: Message-ID: On Thu, 26 May 2022 19:39:42 GMT, Nick Gasson wrote: > The client build fails because foreignGlobals_aarch64.cpp uses Matcher without `#ifdef COMPILER2`. However there's a latent bug here on SVE machines where `RegSpiller::pd_reg_size()` returns the SVE register size but other code that uses the register spill area (e.g. `DowncallStubGenerator::generate()` and `AArch64Architecture.VECTOR_REG_SIZE`) assume that we always save 16 bytes. Since we don't support any calling conventions where arguments/results are passed in long vectors we should > just save the first 128 bits of the register, like x86 does. Marked as reviewed by adinn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8908 From adinn at openjdk.java.net Fri May 27 10:13:45 2022 From: adinn at openjdk.java.net (Andrew Dinn) Date: Fri, 27 May 2022 10:13:45 GMT Subject: Integrated: 8282182: Document algorithm used to encode aarch64 logical immediate operands. In-Reply-To: References: Message-ID: On Fri, 18 Feb 2022 17:12:08 GMT, Andrew Dinn wrote: > This *documentation only* change explains how logical immediate mask values are derived from valid logical instruction operands. The encoding function is used to populate a sparse array that maps valid masks to a unique set of input operand values and a reverse lookup array that maps inputs to the associated mask. This pull request has now been integrated. Changeset: 22e20673 Author: Andrew Dinn URL: https://git.openjdk.java.net/jdk/commit/22e2067349fc8a82bea214a30f5e975bbebcb44b Stats: 84 lines in 1 file changed: 77 ins; 1 del; 6 mod 8282182: Document algorithm used to encode aarch64 logical immediate operands. Reviewed-by: ngasson, aph ------------- PR: https://git.openjdk.java.net/jdk/pull/7536 From rcastanedalo at openjdk.java.net Fri May 27 10:19:08 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 27 May 2022 10:19:08 GMT Subject: RFR: 8285558: IGV: scheduling crashes on control-unreachable CFG nodes Message-ID: This changeset relaxes IGV's schedule approximation algorithm to handle CFG nodes that are not reachable from the root node via a control path. This is done by 1) removing the assumption that, after `ServerCompilerScheduler::buildBlocks()`, `Node::block` is non-null for CFG nodes; and 2) leaving the assignment of blockless nodes to an artificial "no block" to `InputGraph::ensureNodesInBlocks()`, which is always called after running the schedule approximation algorithm. Additionally, the changeset marks unreachable CFG nodes with a warning, making it easier to identify ill-formed graphs: ![screenshot of IGV graph where some nodes are marked with a warning](https://user-images.githubusercontent.com/8792647/170678998-bffc293f-90ce-45e7-9aea-7cc5ab184026.png) #### Testing - Tested manually on the two graphs reported in the [JBS issue](https://bugs.openjdk.java.net/browse/JDK-8285558). - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not introduce any exception or assertion failure. ------------- Commit messages: - Warn about control-unreachable CFG nodes - Handle control-unreachable nodes Changes: https://git.openjdk.java.net/jdk/pull/8916/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8916&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8285558 Stats: 24 lines in 1 file changed: 10 ins; 14 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8916.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8916/head:pull/8916 PR: https://git.openjdk.java.net/jdk/pull/8916 From rcastanedalo at openjdk.java.net Fri May 27 12:57:28 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Fri, 27 May 2022 12:57:28 GMT Subject: RFR: 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors Message-ID: IGV scheduling crashes when breaking critical edges that target Region nodes not marked with the `is_block_start` property, by failing to create an appropriate basic block between the source and the destination of the critical edge (see the JBS bug report for more detail). This changeset ensures that such a basic block is created even when the destination node is not marked with `is_block_start`. #### Testing - Tested manually on the [graph](https://bugs.openjdk.java.net/secure/attachment/99125/failure.zip) reported in the JBS issue. - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. ------------- Commit messages: - Always start a new block after a dummy node is visited Changes: https://git.openjdk.java.net/jdk/pull/8921/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8921&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287438 Stats: 8 lines in 1 file changed: 6 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8921.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8921/head:pull/8921 PR: https://git.openjdk.java.net/jdk/pull/8921 From kvn at openjdk.java.net Fri May 27 15:35:46 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 15:35:46 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 Marked as reviewed by kvn (Reviewer). Testing tier1-4 passed. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From kvn at openjdk.java.net Fri May 27 15:46:32 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 15:46:32 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number In-Reply-To: References: Message-ID: On Fri, 27 May 2022 03:23:47 GMT, Dean Long wrote: > This PR does two things: > - reverts the incorrect change to non_data_bits that included pointer_bits > - treats the data() as an unsigned int to prevent a high bit being treated as a negative number Christian said in #2543: "There is also a second issue that LIR_OprDesc::vreg_max is too big. It is only used in this bailout code. OprBits::vreg_max is defined over OprBits::data_bits which uses OprBits::non_data_bits. But OprBits::non_data_bits does not consider OprBits::pointer_bits which results in a too large value for LIR_OprDesc::vreg_max and the assertion is hit because we don't bail out, yet. This needs to be fixed as well." So yes, fix vreg_max. ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From kvn at openjdk.java.net Fri May 27 15:48:48 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 15:48:48 GMT Subject: RFR: 8285558: IGV: scheduling crashes on control-unreachable CFG nodes In-Reply-To: References: Message-ID: On Fri, 27 May 2022 10:08:22 GMT, Roberto Casta?eda Lozano wrote: > This changeset relaxes IGV's schedule approximation algorithm to handle CFG nodes that are not reachable from the root node via a control path. This is done by 1) removing the assumption that, after `ServerCompilerScheduler::buildBlocks()`, `Node::block` is non-null for CFG nodes; and 2) leaving the assignment of blockless nodes to an artificial "no block" to `InputGraph::ensureNodesInBlocks()`, which is always called after running the schedule approximation algorithm. > > Additionally, the changeset marks unreachable CFG nodes with a warning, making it easier to identify ill-formed graphs: > > ![screenshot of IGV graph where some nodes are marked with a warning](https://user-images.githubusercontent.com/8792647/170678998-bffc293f-90ce-45e7-9aea-7cc5ab184026.png) > > #### Testing > > - Tested manually on the two graphs reported in the [JBS issue](https://bugs.openjdk.java.net/browse/JDK-8285558). > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not introduce any exception or assertion failure. Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8916 From kvn at openjdk.java.net Fri May 27 15:49:44 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 15:49:44 GMT Subject: RFR: 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors In-Reply-To: References: Message-ID: On Fri, 27 May 2022 12:28:53 GMT, Roberto Casta?eda Lozano wrote: > IGV scheduling crashes when breaking critical edges that target Region nodes not marked with the `is_block_start` property, by failing to create an appropriate basic block between the source and the destination of the critical edge (see the JBS bug report for more detail). This changeset ensures that such a basic block is created even when the destination node is not marked with `is_block_start`. > > #### Testing > > - Tested manually on the [graph](https://bugs.openjdk.java.net/secure/attachment/99125/failure.zip) reported in the JBS issue. > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. Okay. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8921 From duke at openjdk.java.net Fri May 27 15:53:52 2022 From: duke at openjdk.java.net (Quan Anh Mai) Date: Fri, 27 May 2022 15:53:52 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 Thanks very much for the testing. May I have some more detailed information please? I can't seem to replicate the failures on my machine. ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From kvn at openjdk.java.net Fri May 27 16:11:38 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 16:11:38 GMT Subject: RFR: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path In-Reply-To: References: Message-ID: On Fri, 27 May 2022 09:10:48 GMT, Xiaolin Zheng wrote: > Hi team, > > ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) > > Some perf results show necessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. > > I was wondering if this minor change could be sponsored? > > This enhancement is raised on behalf of Wei Kuai . > > Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 > [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 > [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 > [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 Looks good. I would consider it as trivial fix. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8915 From kvn at openjdk.java.net Fri May 27 16:52:49 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 16:52:49 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 Unfortunately the only information I have is in JBS comment. I would only look on TestBase64.java failure. It was run on Oracle Linux and Ice Lake Intel's CPU. It could be intermittent failure which is not related to your changes but I don't see any previous records similar to this. ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From kvn at openjdk.java.net Fri May 27 17:38:57 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 17:38:57 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 I will rerun TestBase64.java to see if I can reproduce failure. Please update to latest JDK sources. I can't apply patch. ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From kvn at openjdk.java.net Fri May 27 20:25:50 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 20:25:50 GMT Subject: RFR: 8283699: Improve the peephole mechanism of hotspot [v2] In-Reply-To: References: Message-ID: On Sat, 21 May 2022 12:16:23 GMT, Quan Anh Mai wrote: >> Hi, >> >> The current peephole mechanism has several drawbacks: >> - Can only match and remove adjacent instructions. >> - Cannot match machine ideal nodes (e.g MachSpillCopyNode). >> - Can only replace 1 instruction, the position of insertion is limited to the position at which the matched nodes reside. >> - Is actually broken since the nodes are not connected properly and OptoScheduling requires true dependencies between nodes. >> >> The patch proposes to enhance the peephole mechanism by allowing a peep rule to call into a dedicated function, which takes the responsibility to perform all required transformations on the basic block. This allows the peephole mechanism to perform several transformations effectively in a more fine-grain manner. >> >> The patch uses the peephole optimisation to perform some classic peepholes, transforming on x86 the sequences: >> >> mov r1, r2 -> lea r1, [r2 + r3/i] >> add r1, r3/i >> >> and >> >> mov r1, r2 -> lea r1, [r2 << i], with i = 1, 2, 3 >> shl r1, i >> >> On the added benchmarks, the transformations show positive results: >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 1200.490 ? 104.662 ns/op >> LeaPeephole.B_D_long avgt 5 1211.439 ? 30.196 ns/op >> LeaPeephole.B_I_int avgt 5 1118.831 ? 7.995 ns/op >> LeaPeephole.B_I_long avgt 5 1112.389 ? 15.838 ns/op >> LeaPeephole.I_S_int avgt 5 1262.528 ? 7.293 ns/op >> LeaPeephole.I_S_long avgt 5 1223.820 ? 17.777 ns/op >> >> Benchmark Mode Cnt Score Error Units >> LeaPeephole.B_D_int avgt 5 860.889 ? 6.089 ns/op >> LeaPeephole.B_D_long avgt 5 945.455 ? 21.603 ns/op >> LeaPeephole.B_I_int avgt 5 849.109 ? 9.809 ns/op >> LeaPeephole.B_I_long avgt 5 851.283 ? 16.921 ns/op >> LeaPeephole.I_S_int avgt 5 976.594 ? 23.004 ns/op >> LeaPeephole.I_S_long avgt 5 936.984 ? 9.601 ns/op >> >> A following patch would add IR tests for these transformations since the IR framework has not been able to parse the ideal scheduling yet although printing the scheduling itself has been made possible recently. >> >> Thank you very much. > > Quan Anh Mai has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 23 commits: > > - Merge branch 'master' into peephole > - some fix > - add benchmark > - Merge branch 'master' into peephole > - refactor > - fix? > - refactor > - attempt > - attempt > - build fix > - ... and 13 more: https://git.openjdk.java.net/jdk/compare/72bd41b8...78b4a3f2 I ran previously built binaries and TestBase64.java failure happened 10 out of 10 runs. It is not intermittent. I attached hs_err file to RFE. I created it with `-XX:AbortVMOnException=java.lang.IndexOutOfBoundsException` ------------- PR: https://git.openjdk.java.net/jdk/pull/8025 From vlivanov at openjdk.java.net Fri May 27 20:36:27 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 27 May 2022 20:36:27 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver [v2] In-Reply-To: References: Message-ID: > Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. > > Proposed fix bails out inlining attempt when receiver is null constant. > > C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). > > Testing: hs-tier1 - hs-tier4 Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Fix C2 part ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8894/files - new: https://git.openjdk.java.net/jdk/pull/8894/files/0b35ff5d..eaae7537 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8894&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8894&range=00-01 Stats: 12 lines in 2 files changed: 3 ins; 1 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8894.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8894/head:pull/8894 PR: https://git.openjdk.java.net/jdk/pull/8894 From vlivanov at openjdk.java.net Fri May 27 20:41:35 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 27 May 2022 20:41:35 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver [v3] In-Reply-To: References: Message-ID: > Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. > > Proposed fix bails out inlining attempt when receiver is null constant. > > C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). > > Testing: hs-tier1 - hs-tier4 Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: Redundant import ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8894/files - new: https://git.openjdk.java.net/jdk/pull/8894/files/eaae7537..7ff642f4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8894&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8894&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8894.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8894/head:pull/8894 PR: https://git.openjdk.java.net/jdk/pull/8894 From vlivanov at openjdk.java.net Fri May 27 20:41:36 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 27 May 2022 20:41:36 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver [v3] In-Reply-To: References: Message-ID: <3p_fgtXIlFqHhDc3FVoQXTPvFgVhxFJ3Dq6zLaU-N6g=.d99640bb-53b6-4415-9e04-21729b6ce2e6@github.com> On Wed, 25 May 2022 23:01:16 GMT, Vladimir Kozlov wrote: >> Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: >> >> Redundant import > > src/hotspot/share/opto/callGenerator.cpp line 1021: > >> 1019: if (receiver->Opcode() == Op_ConP) { >> 1020: input_not_const = false; >> 1021: ciObject* recv_obj = receiver->bottom_type()->is_oopptr()->const_oop(); > > In general ConP could be `NULL_PTR` (AnyPtr). I think you need to use `isa_oopptr()` and check for `NULL` here. > Or it can't be `NULL_PTR` in this case? Good catch, Vladimir. I fixed the code and enhanced the test to exercise problematic case by C2. ------------- PR: https://git.openjdk.java.net/jdk/pull/8894 From kvn at openjdk.java.net Fri May 27 20:49:35 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 20:49:35 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver [v3] In-Reply-To: References: Message-ID: On Fri, 27 May 2022 20:41:35 GMT, Vladimir Ivanov wrote: >> Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. >> >> Proposed fix bails out inlining attempt when receiver is null constant. >> >> C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). >> >> Testing: hs-tier1 - hs-tier4 > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > Redundant import Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8894 From vlivanov at openjdk.java.net Fri May 27 21:26:56 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 27 May 2022 21:26:56 GMT Subject: RFR: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver [v3] In-Reply-To: References: Message-ID: On Fri, 27 May 2022 20:41:35 GMT, Vladimir Ivanov wrote: >> Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. >> >> Proposed fix bails out inlining attempt when receiver is null constant. >> >> C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). >> >> Testing: hs-tier1 - hs-tier4 > > Vladimir Ivanov has updated the pull request incrementally with one additional commit since the last revision: > > Redundant import Thanks for the review, Vladimir. ------------- PR: https://git.openjdk.java.net/jdk/pull/8894 From vlivanov at openjdk.java.net Fri May 27 21:26:57 2022 From: vlivanov at openjdk.java.net (Vladimir Ivanov) Date: Fri, 27 May 2022 21:26:57 GMT Subject: Integrated: 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver In-Reply-To: References: Message-ID: <_El1AbtKnRVOu1uP08aS_hRx6YAniQ-8kjauPT0dRHk=.54f8fe48-433a-4c5a-8d64-8c0b369b00b4@github.com> On Wed, 25 May 2022 21:58:01 GMT, Vladimir Ivanov wrote: > Inlining attempt through `MH::invokeBasic()` when receiver is null. It triggers an assert when attempting to extract a `Method*` from a null constant. > > Proposed fix bails out inlining attempt when receiver is null constant. > > C2 has a similar issue, but the particular bytecode shape of `MH::invokeExact()` invoker hides the bug (dominating `MH::type()` call involves a null check and problematic call isn't compiled at all). > > Testing: hs-tier1 - hs-tier4 This pull request has now been integrated. Changeset: d3e781de Author: Vladimir Ivanov URL: https://git.openjdk.java.net/jdk/commit/d3e781de086d557a88105da965ff8a7f9126019c Stats: 102 lines in 3 files changed: 78 ins; 8 del; 16 mod 8287223: C1: Inlining attempt through MH::invokeBasic() with null receiver Reviewed-by: kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8894 From kvn at openjdk.java.net Fri May 27 22:21:40 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Fri, 27 May 2022 22:21:40 GMT Subject: RFR: 8283091: Support type conversion between different data sizes in SLP [v3] In-Reply-To: References: <4qFrtxwRHZfnzJbdVl2z7l6XENvxR3N8Sruc2fraKkw=.1391f639-1514-4e50-899a-d56cdd4afa99@github.com> <3_-2N1Kf4WIryx7eFIrXomabZJTeVNvSJ10joWdzN4s=.a16c8b8e-0834-48f8-9eac-6aaf07822ad5@github.com> Message-ID: On Wed, 25 May 2022 01:13:36 GMT, Fei Gao wrote: >> @fg1417 Thank you for suggesting this optimization. I see that it was not updated for some time. Do you still intend to work on it? >> >> Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. >> >> I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional `Matcher` property we could consult when trying to **auto-vectorize**. I understand that we need `vcvt2Dto2I` when VectorAPI specifically asking to generate it but we should not enforce auto-generation. > >> Please update to latest JDK (Loom was integrated) and run performance again. Also include % of changes. >> >> I have the same concern as @DamonFool about regression when vectorizing some conversions. May be we should have additional `Matcher` property we could consult when trying to **auto-vectorize**. I understand that we need `vcvt2Dto2I` when VectorAPI specifically asking to generate it but we should not enforce auto-generation. > > @vnkozlov thanks for your review and kind suggestion! I'll update the patch to resolve the potential performance regression. @fg1417 I don't see new update in this PR. Please also show performance numbers with new changes ------------- PR: https://git.openjdk.java.net/jdk/pull/7806 From dlong at openjdk.java.net Sat May 28 02:21:35 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 28 May 2022 02:21:35 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: References: Message-ID: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> > This PR does two things: > - reverts the incorrect change to non_data_bits that included pointer_bits > - treats the data() as an unsigned int to prevent a high bit being treated as a negative number Dean Long has updated the pull request incrementally with one additional commit since the last revision: set vreg_max to a more reasonable limit (10000) ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8912/files - new: https://git.openjdk.java.net/jdk/pull/8912/files/3bc73a2e..5e19c9f7 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8912&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8912&range=00-01 Stats: 3 lines in 1 file changed: 2 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8912.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8912/head:pull/8912 PR: https://git.openjdk.java.net/jdk/pull/8912 From dlong at openjdk.java.net Sat May 28 02:25:36 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 28 May 2022 02:25:36 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> References: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> Message-ID: On Sat, 28 May 2022 02:21:35 GMT, Dean Long wrote: >> This PR does two things: >> - reverts the incorrect change to non_data_bits that included pointer_bits >> - treats the data() as an unsigned int to prevent a high bit being treated as a negative number > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > set vreg_max to a more reasonable limit (10000) I set vreg_max to 10,000. I tried 1,000 first, but there were a few cases during testing where this would cause a bailout on hot methods. With 10,000, only CTW (forced compile of cold methods) and the TestTooManyVirtualRegistersMain.java stress test hit the limit. At 10,000 virtual registers, bailout happens after 8 seconds on a aarch64 debug build. ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From kvn at openjdk.java.net Sat May 28 02:59:28 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Sat, 28 May 2022 02:59:28 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> References: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> Message-ID: On Sat, 28 May 2022 02:21:35 GMT, Dean Long wrote: >> This PR does two things: >> - reverts the incorrect change to non_data_bits that included pointer_bits >> - treats the data() as an unsigned int to prevent a high bit being treated as a negative number > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > set vreg_max to a more reasonable limit (10000) Good. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8912 From fyang at openjdk.java.net Sat May 28 04:43:25 2022 From: fyang at openjdk.java.net (Fei Yang) Date: Sat, 28 May 2022 04:43:25 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v3] In-Reply-To: References: Message-ID: On Fri, 27 May 2022 07:24:32 GMT, Xiaolin Zheng wrote: >> Hi team, >> >> `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. >> >> `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. >> Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. >> >> To deal with this, there are two approaches: >> >> (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) >> >> (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) >> >> This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. >> >> Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. >> Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. >> >> Thanks, >> Xiaolin >> >> [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 >> [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 >> [3] https://cdimage.ubuntu.com/releases/22.04/release/ >> [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Fix comments in `patch_addr_in_movptr` after this change Looks fine. This won't be reproduced on RV boards like Unmatched which only implement SV39. But we should fix this for future RV hardwares which will likely to have SV48. ------------- Marked as reviewed by fyang (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8913 From zmiao at openjdk.java.net Sat May 28 06:38:11 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Sat, 28 May 2022 06:38:11 GMT Subject: RFR: JDK-8287349: Merge LDR instructions to improve C1 OSR performance Message-ID: Since MacroAssembler added merge_ldst, we can use different destination registers for contiguous-memory LDR instructions to improve performance. ------------- Commit messages: - JDK-8287349: Merge LDR instructions to improve C1 OSR performance Changes: https://git.openjdk.java.net/jdk/pull/8933/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8933&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287349 Stats: 3 lines in 1 file changed: 1 ins; 1 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8933.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8933/head:pull/8933 PR: https://git.openjdk.java.net/jdk/pull/8933 From aph at openjdk.java.net Sat May 28 06:44:34 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Sat, 28 May 2022 06:44:34 GMT Subject: RFR: JDK-8287349: Merge LDR instructions to improve C1 OSR performance In-Reply-To: References: Message-ID: On Sat, 28 May 2022 06:28:57 GMT, Zhuojun Miao wrote: > Since MacroAssembler added merge_ldst, we can use different > destination registers for contiguous-memory LDR instructions to improve performance. src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 288: > 286: __ ldr(r20, Address(OSR_buf, slot_offset + 1*BytesPerWord)); > 287: __ str(r19, frame_map()->address_for_monitor_lock(i)); > 288: __ str(r20, frame_map()->address_for_monitor_object(i)); I think it would be better to use `ldp` explicitly here, rather than relying on the assembler to do the merge. ------------- PR: https://git.openjdk.java.net/jdk/pull/8933 From aph at openjdk.java.net Sat May 28 06:50:27 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Sat, 28 May 2022 06:50:27 GMT Subject: RFR: JDK-8287349: Merge LDR instructions to improve C1 OSR performance In-Reply-To: References: Message-ID: On Sat, 28 May 2022 06:28:57 GMT, Zhuojun Miao wrote: > Since MacroAssembler added merge_ldst, we can use different > destination registers for contiguous-memory LDR instructions to improve performance. For AArch64-specific PRs, it's a good idea to use a title like "JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance" . That way AArch64 maintainers notice them in the mailing lists. The title of the PR and the bug database entry must match. ------------- PR: https://git.openjdk.java.net/jdk/pull/8933 From dlong at openjdk.java.net Sat May 28 07:51:41 2022 From: dlong at openjdk.java.net (Dean Long) Date: Sat, 28 May 2022 07:51:41 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> References: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> Message-ID: On Sat, 28 May 2022 02:21:35 GMT, Dean Long wrote: >> This PR does two things: >> - reverts the incorrect change to non_data_bits that included pointer_bits >> - treats the data() as an unsigned int to prevent a high bit being treated as a negative number > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > set vreg_max to a more reasonable limit (10000) Thanks Vladimir. ------------- PR: https://git.openjdk.java.net/jdk/pull/8912 From duke at openjdk.java.net Sat May 28 08:56:41 2022 From: duke at openjdk.java.net (kristylee88) Date: Sat, 28 May 2022 08:56:41 GMT Subject: RFR: JDK-8287349: Merge LDR instructions to improve C1 OSR performance In-Reply-To: References: Message-ID: On Sat, 28 May 2022 06:28:57 GMT, Zhuojun Miao wrote: > Since MacroAssembler added merge_ldst, we can use different > destination registers for contiguous-memory LDR instructions to improve performance. Marked as reviewed by kristylee88 at github.com (no known OpenJDK username). ------------- PR: https://git.openjdk.java.net/jdk/pull/8933 From duke at openjdk.java.net Sun May 29 00:07:25 2022 From: duke at openjdk.java.net (Cesar Soares) Date: Sun, 29 May 2022 00:07:25 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v2] In-Reply-To: References: Message-ID: On Tue, 24 May 2022 04:34:54 GMT, Yuta Sato wrote: >> When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , >> it shows totally the same message for c1 and c2 compiler and the user would be confused about >> which compiler is affected by this message. >> This should show messages with their compiler name so that the user knows which compiler shows this message. >> >> My change result would be like the below. >> >> >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> >> -> >> >> OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output >> OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: > > Update full name src/hotspot/share/compiler/compilerDirectives.hpp line 131: > 129: static ccstrlist canonicalize_control_intrinsic(ccstrlist option_value); > 130: void finalize(outputStream* st); > 131: bool is_c1(CompilerDirectives* directive); NIT: these methods can be made "const". ------------- PR: https://git.openjdk.java.net/jdk/pull/8591 From duke at openjdk.java.net Sun May 29 10:05:48 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Sun, 29 May 2022 10:05:48 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v12] In-Reply-To: References: Message-ID: <5QMyywdjzD2-s8KWs9JEZ8Y9vFbeo7jpYUbk6NWI9eU=.96a08769-3cdb-4dc8-b70c-2bd67a98f43c@github.com> > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove collection functions for Node::related, removed in last commit ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/d78fb82e..20e01120 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=11 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=10-11 Stats: 102 lines in 2 files changed: 0 ins; 102 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From xlinzheng at openjdk.java.net Mon May 30 02:54:55 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Mon, 30 May 2022 02:54:55 GMT Subject: RFR: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr [v3] In-Reply-To: References: Message-ID: <952ycGkzxLUMyVL0RTXHaWBsIm-71ze-m3VUU7cUEGM=.ef9d9a1f-45e9-4d75-b06f-657495450118@github.com> On Fri, 27 May 2022 07:24:32 GMT, Xiaolin Zheng wrote: >> Hi team, >> >> `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. >> >> `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. >> Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. >> >> To deal with this, there are two approaches: >> >> (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) >> >> (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) >> >> This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. >> >> Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. >> Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. >> >> Thanks, >> Xiaolin >> >> [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 >> [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 >> [3] https://cdimage.ubuntu.com/releases/22.04/release/ >> [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 > > Xiaolin Zheng has updated the pull request incrementally with one additional commit since the last revision: > > Fix comments in `patch_addr_in_movptr` after this change Yes, it won't be reproduced on a physical board for now. Test results from the last weekend seem okay. Also thank you all for the reviews! Going to push this one. ------------- PR: https://git.openjdk.java.net/jdk/pull/8913 From jbhateja at openjdk.java.net Mon May 30 06:56:27 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 30 May 2022 06:56:27 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: References: Message-ID: > Summary of changes: > > - Patch intrinsifies following newly added Java SE APIs > - Integer.compress > - Integer.expand > - Long.compress > - Long.expand > > - Adds C2 IR nodes and corresponding ideal transformations for new operations. > - We see around ~10x performance speedup due to intrinsification over X86 target. > - Adds an IR framework based test to validate newly introduced IR transformations. > > Kindly review and share your feedback. > > Best Regards, > Jatin Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Extending new IR value routines with value propagation logic. - 8283894: Disabling sanity test as per review suggestion. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - 8283894: Removing CompressExpandSanityTest from problem list. - 8283894: Updating test tag spec. - 8283894: Review comments resolved. - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. - 8283894: Review comments resolutions. - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 - ... and 2 more: https://git.openjdk.java.net/jdk/compare/34cb64d1...a36dba2e ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8498/files - new: https://git.openjdk.java.net/jdk/pull/8498/files/553c3c39..a36dba2e Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=07 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8498&range=06-07 Stats: 39261 lines in 654 files changed: 9828 ins; 27372 del; 2061 mod Patch: https://git.openjdk.java.net/jdk/pull/8498.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8498/head:pull/8498 PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Mon May 30 06:56:28 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 30 May 2022 06:56:28 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: References: Message-ID: On Mon, 2 May 2022 16:12:41 GMT, Paul Sandoz wrote: >> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: >> >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Extending new IR value routines with value propagation logic. >> - 8283894: Disabling sanity test as per review suggestion. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - 8283894: Removing CompressExpandSanityTest from problem list. >> - 8283894: Updating test tag spec. >> - 8283894: Review comments resolved. >> - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. >> - 8283894: Review comments resolutions. >> - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 >> - ... and 2 more: https://git.openjdk.java.net/jdk/compare/34cb64d1...a36dba2e > > Can you update the jtreg tests: > 1. Modify `CompressExpandTest` to run with and without the intrinsic enabled > 2. Disable (by default) `CompressExpandSanityTest` > ? Hi @PaulSandoz , @rose00 , your comments have been addressed. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From jbhateja at openjdk.java.net Mon May 30 06:56:28 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Mon, 30 May 2022 06:56:28 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: References: Message-ID: On Tue, 24 May 2022 17:18:53 GMT, John R Rose wrote: >> I have handled these transformation separately in ideal/identity and value routines. > > OK. I see you did just the constant-folding part of `Value` which is reasonable. > > Please file a followup bug to capture the more elaborate type inferencing proposal, for later use if warranted. Hi @rose00 , I have extended the value routines with value propagation logic. ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From rcastanedalo at openjdk.java.net Mon May 30 07:09:54 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 30 May 2022 07:09:54 GMT Subject: RFR: 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors In-Reply-To: References: Message-ID: On Fri, 27 May 2022 15:47:26 GMT, Vladimir Kozlov wrote: > Okay. Thanks for reviewing! ------------- PR: https://git.openjdk.java.net/jdk/pull/8921 From rcastanedalo at openjdk.java.net Mon May 30 07:10:46 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 30 May 2022 07:10:46 GMT Subject: RFR: 8285558: IGV: scheduling crashes on control-unreachable CFG nodes In-Reply-To: References: Message-ID: On Fri, 27 May 2022 15:46:13 GMT, Vladimir Kozlov wrote: > Good. Thanks for reviewing, Vladimir! ------------- PR: https://git.openjdk.java.net/jdk/pull/8916 From duke at openjdk.java.net Mon May 30 07:36:59 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 07:36:59 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v13] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: refactor dump and dump_ctrl with dump_bfs (new name for print_bfs) ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/20e01120..5cb78328 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=12 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=11-12 Stats: 115 lines in 2 files changed: 8 ins; 96 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From chagedorn at openjdk.java.net Mon May 30 07:36:46 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 30 May 2022 07:36:46 GMT Subject: RFR: 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors In-Reply-To: References: Message-ID: On Fri, 27 May 2022 12:28:53 GMT, Roberto Casta?eda Lozano wrote: > IGV scheduling crashes when breaking critical edges that target Region nodes not marked with the `is_block_start` property, by failing to create an appropriate basic block between the source and the destination of the critical edge (see the JBS bug report for more detail). This changeset ensures that such a basic block is created even when the destination node is not marked with `is_block_start`. > > #### Testing > > - Tested manually on the [graph](https://bugs.openjdk.java.net/secure/attachment/99125/failure.zip) reported in the JBS issue. > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. Looks good! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8921 From rcastanedalo at openjdk.java.net Mon May 30 07:44:28 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 30 May 2022 07:44:28 GMT Subject: RFR: 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors In-Reply-To: References: Message-ID: On Mon, 30 May 2022 07:33:21 GMT, Christian Hagedorn wrote: > Looks good! Thanks, Christian! ------------- PR: https://git.openjdk.java.net/jdk/pull/8921 From chagedorn at openjdk.java.net Mon May 30 07:34:35 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 30 May 2022 07:34:35 GMT Subject: RFR: 8285558: IGV: scheduling crashes on control-unreachable CFG nodes In-Reply-To: References: Message-ID: On Fri, 27 May 2022 10:08:22 GMT, Roberto Casta?eda Lozano wrote: > This changeset relaxes IGV's schedule approximation algorithm to handle CFG nodes that are not reachable from the root node via a control path. This is done by 1) removing the assumption that, after `ServerCompilerScheduler::buildBlocks()`, `Node::block` is non-null for CFG nodes; and 2) leaving the assignment of blockless nodes to an artificial "no block" to `InputGraph::ensureNodesInBlocks()`, which is always called after running the schedule approximation algorithm. > > Additionally, the changeset marks unreachable CFG nodes with a warning, making it easier to identify ill-formed graphs: > > ![screenshot of IGV graph where some nodes are marked with a warning](https://user-images.githubusercontent.com/8792647/170678998-bffc293f-90ce-45e7-9aea-7cc5ab184026.png) > > #### Testing > > - Tested manually on the two graphs reported in the [JBS issue](https://bugs.openjdk.java.net/browse/JDK-8285558). > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not introduce any exception or assertion failure. Looks good! That's useful to mark them with a warning when analyzing dead loop problems. ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8916 From rcastanedalo at openjdk.java.net Mon May 30 07:45:42 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Mon, 30 May 2022 07:45:42 GMT Subject: RFR: 8285558: IGV: scheduling crashes on control-unreachable CFG nodes In-Reply-To: References: Message-ID: <4Kk47njMjaNFgui89M1rI2DkNYv-78AEQLpevZzfENk=.35efc991-7ac8-4313-8004-4e9c159571b8@github.com> On Mon, 30 May 2022 07:30:49 GMT, Christian Hagedorn wrote: > Looks good! That's useful to mark them with a warning when analyzing dead loop problems. Thanks for reviewing, Christian! ------------- PR: https://git.openjdk.java.net/jdk/pull/8916 From chagedorn at openjdk.java.net Mon May 30 07:47:44 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 30 May 2022 07:47:44 GMT Subject: RFR: 8287396 LIR_Opr::vreg_number() and data() can return negative number [v2] In-Reply-To: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> References: <8TqlrqeRZn0ptrN5Qb-cq_0mh3bskNwMVgSi94TRclI=.44c3f2ac-de40-4127-bffb-0ed5a3d74f5f@github.com> Message-ID: On Sat, 28 May 2022 02:21:35 GMT, Dean Long wrote: >> This PR does two things: >> - reverts the incorrect change to non_data_bits that included pointer_bits >> - treats the data() as an unsigned int to prevent a high bit being treated as a negative number > > Dean Long has updated the pull request incrementally with one additional commit since the last revision: > > set vreg_max to a more reasonable limit (10000) That looks more reasonable, thanks for fixing it! Have you also run some additional performance testing? ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8912 From xlinzheng at openjdk.java.net Mon May 30 07:50:41 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Mon, 30 May 2022 07:50:41 GMT Subject: Integrated: 8287418: riscv: Fix correctness issue of MacroAssembler::movptr In-Reply-To: References: Message-ID: On Fri, 27 May 2022 04:37:01 GMT, Xiaolin Zheng wrote: > Hi team, > > `MacroAssembler::movptr()` is designed to load a 47-bit (unsigned) address constant, ranging `[0x0, 0x7FFF_FFFF_FFFF]`, and a special case -1 (`the Universe::non_oop_word()` as we know, which is `0xFFFF_FFFF_FFFF_FFFF`). The former ones are inside a sv48 address space range[1]. Please note that under sv48 a valid address has the bit 47 equal to 0 in user space, so that `MacroAssembler::movptr()` could cover all cases under sv48. However, when loading an immediate value ranging `[0x7FFF_8000_0000, 0x7FFF_FFFF_FFFF]` using it, the results would wrongly become `[0xFFFF_7FFF_8000_0000, 0xFFFF_7FFF_FFFF_FFFF]`, which indicates the MSB has polluted high bits in rare cases. > > `MacroAssembler::movptr()` is a composition of `lui+addi+slli+addi+slli+addi`, and all of them are signed operations, MIPS alike. > Precisely, the first `lui+addi` aims to load the first `32-bit`; then the `slli+addi` would load the `11-bit`; finally the last `slli+addi` is going to load the remaining `5-bit`. > > To deal with this, there are two approaches: > > (a) Use an `addiw` to replace the first `addi`. `addiw` has nearly the same semantics as `addi`, but after the operation the result would be sign-extended according to the bit 31. Due to this feature, we could use this to clean up the dirty high bits at all times. This could also handle the (-1) case. However, `Assembler::li32()`, which is composed of `lui+addiw`, will conflict with the new implementation, needing further adaptations. (Personally I a bit dislike of that) > > (b) Alike V8's implementation [2], the trick here is it loads only the first 31-bit using `lui+addi`, with a leading 0 as the bit 31. So this one could prevent this issue at the beginning. As a trade-off, we need to shift one another bit because the leading 0 occupies one bit. Also this one could also handle the (-1) case as well after minor adaptations. (I like this one) > > This problem could be reproduced using `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` with fastdebug build, and on Qemu only, for currently I have no access to hardware that supports sv48, and the kernel Ubuntu[3] relies on is Linux 5.15. The kernel (TIP) would first check if hardware sponsors sv57, if not then fall back to sv48, and so on. It is not until Linux 5.17 that sv48 is supported[4]. So this issue could never be reproduced on my boards. But fortunately Qemu could sponsor this, because one could mmap an address in 48-bit address space even in a user-level Qemu. > > Tested with `-XX:CompressedClassSpaceBaseAddress=0x7FFFF8000000 -XX:CompressedClassSpaceSize=40M -Xshare:off` (reproducible) on Qemu with hotspot tier1 (we should ignore OOM caused the compressed class space), and other tiers are on the way. > Testing sanity hotspot tier1~tier4 (could not reproduce). Tier1 is finished without new failures. > > Thanks, > Xiaolin > > [1] https://github.com/riscv/riscv-isa-manual/blob/9ec8c0105dbf1492b57f6cafdb90a268628f476a/src/supervisor.tex#L1999-L2006 > [2] https://github.com/v8/v8/blob/main/src/codegen/riscv64/assembler-riscv64.cc#L3479-L3495 > [3] https://cdimage.ubuntu.com/releases/22.04/release/ > [4] https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.17-RISC-V-sv48 This pull request has now been integrated. Changeset: 447ae006 Author: Xiaolin Zheng Committer: Fei Yang URL: https://git.openjdk.java.net/jdk/commit/447ae006163b00cc46cac1c7ebe201de311bf1a1 Stats: 20 lines in 5 files changed: 1 ins; 0 del; 19 mod 8287418: riscv: Fix correctness issue of MacroAssembler::movptr Reviewed-by: fjiang, yadongwang, fyang ------------- PR: https://git.openjdk.java.net/jdk/pull/8913 From chagedorn at openjdk.java.net Mon May 30 07:53:43 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 30 May 2022 07:53:43 GMT Subject: RFR: JDK-8287288: Fix some typos in C1 [v2] In-Reply-To: References: Message-ID: On Fri, 27 May 2022 06:13:35 GMT, Zhuojun Miao wrote: >> This is a trivial patch to fix some typos in C1. >> e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 > > Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains two additional commits since the last revision: > > - Complete the diagram with other non-data bits > - JDK-8287288: Fix some typos in C1 Marked as reviewed by chagedorn (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From zmiao at openjdk.java.net Mon May 30 07:56:44 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Mon, 30 May 2022 07:56:44 GMT Subject: Integrated: JDK-8287288: Fix some typos in C1 In-Reply-To: References: Message-ID: On Wed, 25 May 2022 09:11:23 GMT, Zhuojun Miao wrote: > This is a trivial patch to fix some typos in C1. > e.g. https://github.com/openjdk/jdk/blob/master/src/hotspot/share/c1/c1_LIRGenerator.cpp#L1190 This pull request has now been integrated. Changeset: 1b9987cb Author: Zhuojun Miao Committer: Christian Hagedorn URL: https://git.openjdk.java.net/jdk/commit/1b9987cb08611a98e6351876aa7da4e56d4a5d2e Stats: 16 lines in 6 files changed: 0 ins; 3 del; 13 mod 8287288: Fix some typos in C1 Reviewed-by: aph, dholmes, dlong, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8880 From ngasson at openjdk.java.net Mon May 30 08:13:44 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Mon, 30 May 2022 08:13:44 GMT Subject: Integrated: 8287195: AArch64: Client VM build failure after JDK-8283689 In-Reply-To: References: Message-ID: On Thu, 26 May 2022 19:39:42 GMT, Nick Gasson wrote: > The client build fails because foreignGlobals_aarch64.cpp uses Matcher without `#ifdef COMPILER2`. However there's a latent bug here on SVE machines where `RegSpiller::pd_reg_size()` returns the SVE register size but other code that uses the register spill area (e.g. `DowncallStubGenerator::generate()` and `AArch64Architecture.VECTOR_REG_SIZE`) assume that we always save 16 bytes. Since we don't support any calling conventions where arguments/results are passed in long vectors we should > just save the first 128 bits of the register, like x86 does. This pull request has now been integrated. Changeset: 19fb8ab8 Author: Nick Gasson URL: https://git.openjdk.java.net/jdk/commit/19fb8ab8b9a3366850ed224c35f3cd163c0511e5 Stats: 18 lines in 1 file changed: 0 ins; 15 del; 3 mod 8287195: AArch64: Client VM build failure after JDK-8283689 Reviewed-by: jvernee, adinn ------------- PR: https://git.openjdk.java.net/jdk/pull/8908 From chagedorn at openjdk.java.net Mon May 30 10:41:35 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 30 May 2022 10:41:35 GMT Subject: RFR: 8286940: [IR Framework] Allow IR tests to build and use Whitebox without -DSkipWhiteBoxInstall=true In-Reply-To: References: Message-ID: On Wed, 25 May 2022 08:17:17 GMT, Christian Hagedorn wrote: > Currently, the IR framework always tries to install the Whitebox by moving the Whitebox class file to the JTreg class path. However, when a test already builds the Whitebox and uses it as part of the test, we cannot access it on certain platforms. On Windows, for example, we'll get the following exception: > > Caused by: java.nio.file.FileSystemException: sun\hotspot\WhiteBox.class: The process cannot access the file because it is being used by another process > > To mitigate this problem, one can specify `-DSkipWhiteBoxInstall=true` which was already done in [JDK-8283187](https://bugs.openjdk.java.net/browse/JDK-8283187). But this is not a good solution as the user should not need to worry about the inner workings of the IR framework. > > I propose to get rid of this flag by reworking the Whitebox installation process. > > Thanks, > Christian Thanks Vladimir for your review! ------------- PR: https://git.openjdk.java.net/jdk/pull/8879 From duke at openjdk.java.net Mon May 30 10:51:15 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 10:51:15 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v14] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove dead function declaration ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/5cb78328..1449b4b9 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=13 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=12-13 Stats: 1 line in 1 file changed: 0 ins; 1 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Mon May 30 11:16:22 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 11:16:22 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v15] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with four additional commits since the last revision: - Revert "remove dead function declaration" This reverts commit 1449b4b93c1d3bff19f0a784f6f594f3ba3453f0. - Revert "remove Node::related, dump_related, dump_related_compact" This reverts commit d78fb82ead76f9f60415657a030b2e1b0c15d246. - Revert "remove collection functions for Node::related, removed in last commit" This reverts commit 20e01120663b80ab79cca9fde6a1c3d9fc55f0dd. - Revert "refactor dump and dump_ctrl with dump_bfs (new name for print_bfs)" This reverts commit 5cb78328c1ec660ec0fd0791eae3b56d853ef71a. ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/1449b4b9..49d3845b Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=14 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=13-14 Stats: 508 lines in 13 files changed: 489 ins; 4 del; 15 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Mon May 30 11:35:34 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 11:35:34 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v16] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove todos ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/49d3845b..a3422b30 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=15 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=14-15 Stats: 3 lines in 2 files changed: 0 ins; 1 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From fgao at openjdk.java.net Mon May 30 11:44:39 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Mon, 30 May 2022 11:44:39 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: <9Q6AYlmc11Qlp9wOhyoUBbym8J9fsrk_yOLpZkn91qg=.d395e9fd-332a-4ffd-a17e-3f4c9884e3c9@github.com> On Fri, 27 May 2022 15:32:17 GMT, Vladimir Kozlov wrote: >> Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into fg8282470 >> >> Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b >> - Merge branch 'master' into fg8282470 >> >> Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 >> - 8282470: Eliminate useless sign extension before some subword integer operations >> >> Some loop cases of subword types, including byte and >> short, can't be vectorized by C2's SLP. Here is an example: >> ``` >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // *line A* >> sres[i] = (short) (b[i] + c[i]); // *line B* >> } >> } >> ``` >> However, similar cases of int/float/double/long/char type can >> be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is >> that, as illustrated here[1], the result of the scalar add >> operation on *line A* has been promoted to int type. It needs >> to be narrowed to short type first before it can work as one >> of source operands of addition on *line B*. The demotion is >> done by left-shifting 16 bits then right-shifting 16 bits. >> The ideal graph for the process is showed like below. >> >> LoadS a[i] 8 >> \ / >> AddI (line A) >> / \ >> StoreC b[i] Lshift 16bits >> \ >> RShiftI 16 bits LoadS c[i] >> \ / >> AddI (line B) >> \ >> StoreC sres[i] >> >> In SLP, for most short-type cases, we can determine the precise >> type of the scalar int-type operation and finally execute it >> with short-type vector operations[2], except rshift opcode and >> abs in some situations[3]. But in this case, the source operand >> of RShiftI is from LShiftI rather than from any LoadS[4], so we >> can't determine its real type and conservatively assign it with >> int type rather than real short type. The int-type opearation >> RShiftI here can't be vectorized together with other short-type >> operations, like AddI(line B). The reason for byte loop cases >> is the same. Similar loop cases of char type could be >> vectorized because its demotion from int to char is done by >> `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like >> `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short >> cases, to vectorize more scenarios. Optimizing it in the >> mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension >> before some subword integer operations like: >> >> ``` >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> ``` >> to >> ``` >> short s = (short) (x OP y); >> ``` >> >> In the patch, assuming that `x` can be any int number, we need >> guarantee that the optimization doesn't have any impact on >> result. Not all arithmetic logic OPs meet the requirements. For >> example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, >> `short s = (short) (((131068 << 16) >> 16) / 50)` is not >> equal to `short s = (short) (131068 / 50)`. When OP is division, >> we may get different result with or without demotion >> before OP, because the upper 16 bits of division may have >> influence on the lower 16 bits of result, which can't be >> optimized. All optimizable opcodes are listed in >> StoreNode::no_need_sign_extension(), whose upper 16 bits of src >> operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower >> 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> ``` >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> ``` >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before >> and after this patch on both AArch64 and x64 machines. >> We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 >> >> Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 > > Testing tier1-4 passed. Thanks a lot for your kind review and also for your test work, @vnkozlov . ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From duke at openjdk.java.net Mon May 30 12:03:09 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 12:03:09 GMT Subject: RFR: 8283466: C2: missing skeleton predicates in peeled loop Message-ID: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> Implemented initializing skeleton predicates for the peeled loop. We have some predicates / checks before loops that are dependent on the range of the loop, they are checked at runtime. When we split a loop (eg. peeling, pre/main/post, unswitching) one of the sub-loops may get impossible data types and remove the data flow. For static type analysis for the control flow, we need so called skeleton predicates that implement these loop checks before each split off loop. If we do not do this static analysis we generally get `bad graph` asserts, as only removing data flow and not control flow leads to broken graphs. This was already implemented for pre/main/post loops and loop unswitching, but not for peeling. Ran large test suite. Manual inspection shows that the instantiated skeleton predicate indeed collapses, in the provided regression test. Rerunning some tests now... ------------- Commit messages: - remove refactoring assert / old code - fixed comments after Christian's review - response to Christian's review comments - rename, is not known that it is actually loop node - fixed bug: only move dependency if it is the old node of a newly cloned loop_node - greatly reduced and commented test. And it should reproduce again after Rolands loop-incr type fix - small renaming and improved comments - small prettification - refactoring find_all_predicates to class Predicates - unnecessary code removal and renamings - ... and 2 more: https://git.openjdk.java.net/jdk/compare/47500b24...a4091b66 Changes: https://git.openjdk.java.net/jdk/pull/8783/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8783&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8283466 Stats: 248 lines in 4 files changed: 211 ins; 0 del; 37 mod Patch: https://git.openjdk.java.net/jdk/pull/8783.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8783/head:pull/8783 PR: https://git.openjdk.java.net/jdk/pull/8783 From duke at openjdk.java.net Mon May 30 12:14:14 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 12:14:14 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v17] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: allow traversal through constants, just like in dump ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/a3422b30..71723339 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=16 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=15-16 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Mon May 30 12:16:01 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 12:16:01 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v18] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: rename to dump_bfs and remove unecessary prints ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/71723339..275e6910 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=17 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=16-17 Stats: 18 lines in 2 files changed: 0 ins; 10 del; 8 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Mon May 30 12:42:10 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 12:42:10 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v19] In-Reply-To: References: Message-ID: <1YaTbQd96w-rdk_oPDD3iKQEjBBAaZ6w29jIxEs8VMA=.73180a34-a63a-4ea8-917b-c39347598dca@github.com> > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: sort displayed nodes by idx with options character S ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/275e6910..90f325ba Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=18 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=17-18 Stats: 13 lines in 1 file changed: 10 ins; 0 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From chagedorn at openjdk.java.net Mon May 30 12:46:42 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Mon, 30 May 2022 12:46:42 GMT Subject: RFR: 8285965: TestScenarios.java does not check for "" correctly In-Reply-To: References: Message-ID: On Wed, 11 May 2022 06:13:12 GMT, Christian Hagedorn wrote: > This is another rare occurrence of `` that is not handled correctly by `TestScenarios.java`. > > We wrongly search this safepoint message in the test VM output with `getTestVMOutput()`: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/testlibrary_tests/ir_framework/tests/Utils.java#L44-L53 > > But this does not help since the IR matcher is parsing the `hotspot_pid` file for IR matching and not the test VM output. We could therefore find this safepoint message in the `hotspod_pid` file and bail out of IR matching while the test VM output does not contain it. This lets `TestScenarios.java` fail. > > The fix we did for other IR framework tests is to redirect the output of the JTreg test VM itself to a stream in order to search it for ``. We are dumping this message as part of a warning when the IR matcher bails out: > > https://github.com/openjdk/jdk/blob/9c2548414c71b4caaad6ad9e1b122f474e705300/test/hotspot/jtreg/compiler/lib/ir_framework/driver/irmatching/IRMatcher.java#L86-L96 > > Output for the reported failure: > > Scenario #3 - [-XX:TLABRefillWasteFraction=53]: > [...] > Found , bail out of IR matching > > > I suggest to use the same fix for `TestScenarios`. > > Thanks, > Christian May I get a second review for this? ------------- PR: https://git.openjdk.java.net/jdk/pull/8647 From roland at openjdk.java.net Mon May 30 13:59:12 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 30 May 2022 13:59:12 GMT Subject: RFR: 8286451: C2: assert(nb == 1) failed: only when the head is not shared Message-ID: nb counts the number of loops that share a single head. The assert that fires is in code that handles the case of a self loop (a loop composed of a single block). There can be a self loop and multiple loops that share a head: the assert makes little sense and I propose to simply remove it. I think there's another issue with this code: in the case of a self loop and multiple loops that share a head, the self loop can be any of the loop for which the head is cloned not only the one that's passed as argument to ciTypeFlow::clone_loop_head(). As a consequence, I moved the logic for self loops in the loop that's applied to all loops that share the loop head. ------------- Commit messages: - fix & test Changes: https://git.openjdk.java.net/jdk/pull/8947/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8947&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8286451 Stats: 79 lines in 2 files changed: 65 ins; 14 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8947.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8947/head:pull/8947 PR: https://git.openjdk.java.net/jdk/pull/8947 From roland at openjdk.java.net Mon May 30 14:38:42 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Mon, 30 May 2022 14:38:42 GMT Subject: RFR: 8286197: C2: Optimize MemorySegment shape in int loop [v2] In-Reply-To: References: Message-ID: <0sfbxQ23ac25eth57KgqJcP7cckqKXiwvS-ubIYRDc0=.9591b5c7-73f1-4090-b251-e7439d2faed6@github.com> On Fri, 6 May 2022 09:18:27 GMT, Roland Westrelin wrote: >> This is another small enhancement for a code shape that showed up in a >> MemorySegment micro benchmark. The shape to optimize is the one from test1: >> >> >> for (int i = 0; i < size; i++) { >> long j = i * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> if (((base + j) & 3) != 0) { >> throw new RuntimeException(); >> } >> >> v += UNSAFE.getInt(base + j); >> } >> >> >> In that code shape, the loop iv is first scaled, result is then casted >> to long, range checked and finally address of memory location is >> computed. >> >> The alignment check is transformed so the loop body has no check In >> order to eliminate the range check, that loop is transformed into: >> >> >> for (int i1 = ..) { >> for (int i2 = ..) { >> long j = (i1 + i2) * UNSAFE.ARRAY_INT_INDEX_SCALE; >> >> j = Objects.checkIndex(j, size * 4); >> >> v += UNSAFE.getInt(base + j); >> } >> } >> >> >> The address shape is (AddP base (CastLL (ConvI2L (LShiftI (AddI ... >> >> In this case, the type of the ConvI2L is [min_jint, max_jint] and type >> of CastLL is [0, max_jint] (the CastLL has a narrower type). >> >> I propose transforming (CastLL (ConvI2L into (ConvI2L (CastII in that >> case. The convI2L and CastII types can be set to [0, max_jint]. The >> new address shape is then: >> >> (AddP base (ConvI2L (CastII (LShiftI (AddI ... >> >> which optimize well. >> >> (LShiftI (AddI ... >> is transformed into >> (AddI (LShiftI ... >> because one of the AddI input is loop invariant (i2) and we have: >> >> (AddP base (ConvI2L (CastII (AddI (LShiftI ... >> >> Then because the ConvI2L and CastII types are [0, max_jint], the AddI >> is pushed through the ConvI2L and CastII: >> >> (AddP base (AddL (ConvI2L (CastII (LShiftI ... >> >> base and one of the inputs of the AddL are loop invariant so this >> transformed into: >> >> (AddP (AddP ...) (ConvI2L (CastII (LShiftI ... >> >> The (AddP ...) is loop invariant so computed before entry. The >> (ConvI2L ...) only depends on the loop iv. >> >> The resulting address is a shift + an add. The address before >> transformation requires 2 adds + a shift. Also after unrolling, the >> adress of the second access in the loop is cheaper to compute as it >> can be derived from the address of the first access. >> >> For all of this to work: >> 1) I added a CastLL::Ideal transformation: >> (CastLL (ConvI2L into (ConvI2l (CastII >> >> 2) I also had to prevent split if to transform (LShiftI (Phi for the >> iv Phi of a counted loop. >> >> >> test2 and test3 test 1) and 2) separately. > > Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision: > > review Any one else for this one? ------------- PR: https://git.openjdk.java.net/jdk/pull/8555 From duke at openjdk.java.net Mon May 30 15:29:39 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Mon, 30 May 2022 15:29:39 GMT Subject: RFR: 8283775: VM support for graph querying in debugger with BFS traversal and node filtering [v20] In-Reply-To: References: Message-ID: > **Note: Refactoring and extension still in progress** > > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to traverse. > > `void Node::print_bfs(const uint max_distance, Node* target, const char* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. The parent column shows the node one step closer to the BFS root (this). > 2. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 3. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! > 4. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. > 5. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. > > Example: > > (rr) p find_node(35)->print_bfs(2, 0, "cdmox+") > No target: perform BFS. > dis par c dump > --------------------------------------------- > 0 35 d 35 CmpP === _ 34 25 [[ 36 ]] > 1 35 d 34 LoadP === _ 31 33 [[ 35 ]] > 1 35 d 25 ConP === 0 [[ 26 27 31 35 41 ]] #NULL > 2 34 m 31 StoreP === 20 27 29 25 [[ 23 34 41 42 ]] > 2 34 d 33 AddP === _ 1 12 32 [[ 34 ]] > > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(4, 0, "cdmox+OB") > No target: perform BFS. > dis [head idom d] old par c dump > --------------------------------------------- > 0 159 147 6 _ 159 c 159 Region === 159 57 [[ 159 158 59 ]] > 1 147 148 5 o183 159 c 57 IfTrue === 8 [[ 159 ]] > 2 147 148 5 o182 57 c 8 jmpConU === 147 9 [[ 7 57 ]] > 3 147 148 5 _ 8 c 147 Region === 147 14 [[ 147 8 ]] > 3 147 148 5 o180 8 d 9 compUL_rReg === _ 10 13 [[ 8 ]] > 4 148 149 4 o174 147 c 14 IfTrue === 15 [[ 147 ]] > 4 147 148 5 o203 9 d 10 decL_rReg === _ 11 [[ 12 9 ]] > 4 147 148 5 o179 9 d 13 convI2L_reg_reg === _ 28 [[ 9 ]] > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "cox+")` > This provides us with a shortest path, given this path has a distance of at most 20. > > Example: > > (rr) p find_node(158)->print_bfs(20, find_node(160), "cox+") > Find shortest path: 158 -> 160. > > Backtrace target. > dis c dump > --------------------------------------------- > 9 c 160 OuterStripMinedLoop === 160 339 159 [[ 160 358 ]] > 8 c 358 CountedLoop === 358 160 143 [[ 358 362 363 ]] > 7 c 363 If === 358 351 [[ 364 367 ]] > 6 c 364 IfTrue === 363 [[ 128 ]] > 5 c 128 If === 364 127 [[ 129 130 ]] > 4 c 129 IfTrue === 128 [[ 155 ]] > 3 c 155 CountedLoopEnd === 129 154 [[ 157 143 ]] [lt] > 2 c 157 IfFalse === 155 [[ 162 163 ]] > 1 c 162 SafePoint === 157 1 7 1 1 163 100 1 1 13 27 133 [[ 158 ]] > 0 c 158 OuterStripMinedLoopEnd === 162 156 [[ 159 227 ]] > > Example with Mach nodes: > > (rr) p ctrl->print_bfs(10, val, "cdmox-+OB") > Find shortest path: 159 -> 27. > > Backtrace target. > dis [head idom d] old e c dump > --------------------------------------------- > 2 24 1 2 o10 + d 27 MachProj === 24 [[ 19 28 4 59 95 99 118 ]] > 1 56 159 7 o239 - d 59 loadB === 159 29 27 60 [[ 55 ]] > 0 159 147 6 _ c 159 Region === 159 57 [[ 159 158 59 ]] Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: implemented all paths ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/90f325ba..65144f95 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=19 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=18-19 Stats: 88 lines in 1 file changed: 68 ins; 9 del; 11 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From xlinzheng at openjdk.java.net Tue May 31 02:40:39 2022 From: xlinzheng at openjdk.java.net (Xiaolin Zheng) Date: Tue, 31 May 2022 02:40:39 GMT Subject: RFR: 8287425: Remove unnecessary register push for MacroAssembler::check_klass_subtype_slow_path In-Reply-To: References: Message-ID: <1MRypsSB5vZMYN27tk0vSg9HrXr-PIVICU_SXigdFsg=.0096153c-b3fb-4632-bf5d-bf287dfb65ff@github.com> On Fri, 27 May 2022 09:10:48 GMT, Xiaolin Zheng wrote: > Hi team, > > ![AE98A8E7-9F6F-4722-B310-299A9A96A957](https://user-images.githubusercontent.com/38156692/170670906-2ce37a13-af21-4cf8-acbd-ca24528bc3a9.png) > > Some perf results show unnecessary pushes in `MacroAssembler::check_klass_subtype_slow_path()` under `UseCompressedOops`. History logs show the original code is like [1], and it gets refactored in [JDK-6813212](https://bugs.openjdk.java.net/browse/JDK-6813212), and the counterparts of the `UseCompressedOops` in the diff are at [2] and [3], and we could see the push of rax is just because `encode_heap_oop_not_null()` would kill it, so here needs a push and restore. After that, [JDK-6964458](https://bugs.openjdk.java.net/browse/JDK-6964458) (removal of perm gen) at [4] removed [3] so that there is no need to do UseCompressedOops work in `MacroAssembler::check_klass_subtype_slow_path()`; but in that patch [2] didn't get removed, so we finally come here. As a result, [2] could also be safely removed. > > I was wondering if this minor change could be sponsored? > > This enhancement is raised on behalf of Wei Kuai . > > Tested x86_64 hotspot tier1~tier4 twice, aarch64 hotspot tier1~tier4 once with another jdk tier1 once, and riscv64 hotspot tier1~tier4 once. > > Thanks, > Xiaolin > > [1] https://github.com/openjdk/jdk/blob/de67e5294982ce197f2abd051cbb1c8aa6c29499/hotspot/src/cpu/x86/vm/interp_masm_x86_64.cpp#L273-L284 > [2] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7441-R7444 > [3] https://github.com/openjdk/jdk/commit/b8dbe8d8f650124b61a4ce8b70286b5b444a3316#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3R7466-R7477 > [4] https://github.com/openjdk/jdk/commit/5c58d27aac7b291b879a7a3ff6f39fca25619103#diff-beb6684583b0a552a99bbe4b5a21828489a6d689b32a05e1a9af8c3be9f463c3L9347-L9361 Gentle ping - maybe another review of this? For this may change different platforms. ------------- PR: https://git.openjdk.java.net/jdk/pull/8915 From duke at openjdk.java.net Tue May 31 03:22:46 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Tue, 31 May 2022 03:22:46 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v3] In-Reply-To: References: Message-ID: > When using Compiler Directive such as `java -XX:+UnlockDiagnosticVMOptions -XX:CompilerDirectivesFile= ` , > it shows totally the same message for c1 and c2 compiler and the user would be confused about > which compiler is affected by this message. > This should show messages with their compiler name so that the user knows which compiler shows this message. > > My change result would be like the below. > > > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > > -> > > OpenJDK 64-Bit Server VM warning: c1: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output > OpenJDK 64-Bit Server VM warning: c2: printing of assembly code is enabled; turning on DebugNonSafepoints to gain additional output Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: add const to method ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8591/files - new: https://git.openjdk.java.net/jdk/pull/8591/files/64256f06..d1669100 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8591&range=02 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8591&range=01-02 Stats: 4 lines in 2 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8591.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8591/head:pull/8591 PR: https://git.openjdk.java.net/jdk/pull/8591 From duke at openjdk.java.net Tue May 31 03:27:41 2022 From: duke at openjdk.java.net (Yuta Sato) Date: Tue, 31 May 2022 03:27:41 GMT Subject: RFR: 8286990: Add compiler name to warning messages in Compiler Directive [v2] In-Reply-To: References: Message-ID: On Sun, 29 May 2022 00:03:42 GMT, Cesar Soares wrote: >> Yuta Sato has updated the pull request incrementally with one additional commit since the last revision: >> >> Update full name > > src/hotspot/share/compiler/compilerDirectives.hpp line 131: > >> 129: static ccstrlist canonicalize_control_intrinsic(ccstrlist option_value); >> 130: void finalize(outputStream* st); >> 131: bool is_c1(CompilerDirectives* directive); > > NIT: these methods can be made "const". @JohnTortugo Thank you for your advice. I added "const" to these methods. ------------- PR: https://git.openjdk.java.net/jdk/pull/8591 From rcastanedalo at openjdk.java.net Tue May 31 07:04:26 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 31 May 2022 07:04:26 GMT Subject: Integrated: 8285558: IGV: scheduling crashes on control-unreachable CFG nodes In-Reply-To: References: Message-ID: On Fri, 27 May 2022 10:08:22 GMT, Roberto Casta?eda Lozano wrote: > This changeset relaxes IGV's schedule approximation algorithm to handle CFG nodes that are not reachable from the root node via a control path. This is done by 1) removing the assumption that, after `ServerCompilerScheduler::buildBlocks()`, `Node::block` is non-null for CFG nodes; and 2) leaving the assignment of blockless nodes to an artificial "no block" to `InputGraph::ensureNodesInBlocks()`, which is always called after running the schedule approximation algorithm. > > Additionally, the changeset marks unreachable CFG nodes with a warning, making it easier to identify ill-formed graphs: > > ![screenshot of IGV graph where some nodes are marked with a warning](https://user-images.githubusercontent.com/8792647/170678998-bffc293f-90ce-45e7-9aea-7cc5ab184026.png) > > #### Testing > > - Tested manually on the two graphs reported in the [JBS issue](https://bugs.openjdk.java.net/browse/JDK-8285558). > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does not introduce any exception or assertion failure. This pull request has now been integrated. Changeset: 8f59422d Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/8f59422d357a00a2270a8f421966977e3979c2fb Stats: 24 lines in 1 file changed: 10 ins; 14 del; 0 mod 8285558: IGV: scheduling crashes on control-unreachable CFG nodes Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8916 From zmiao at openjdk.java.net Tue May 31 07:47:24 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Tue, 31 May 2022 07:47:24 GMT Subject: RFR: JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance [v2] In-Reply-To: References: Message-ID: > Since MacroAssembler added merge_ldst, we can use different > destination registers for contiguous-memory LDR instructions to improve performance. Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: - Merge branch 'master' into JDK-8287349 - use ldp explicitly - JDK-8287349: Merge LDR instructions to improve C1 OSR performance ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8933/files - new: https://git.openjdk.java.net/jdk/pull/8933/files/b4b8a942..a024e1e4 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8933&range=01 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8933&range=00-01 Stats: 3230 lines in 102 files changed: 1813 ins; 683 del; 734 mod Patch: https://git.openjdk.java.net/jdk/pull/8933.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8933/head:pull/8933 PR: https://git.openjdk.java.net/jdk/pull/8933 From roland at openjdk.java.net Tue May 31 07:48:53 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 31 May 2022 07:48:53 GMT Subject: RFR: 8283466: C2: missing skeleton predicates in peeled loop In-Reply-To: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> References: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> Message-ID: On Thu, 19 May 2022 08:56:22 GMT, Emanuel Peter wrote: > Implemented initializing skeleton predicates for the peeled loop. > > We have some predicates / checks before loops that are dependent on the range of the loop, they are checked at runtime. When we split a loop (eg. peeling, pre/main/post, unswitching) one of the sub-loops may get impossible data types and remove the data flow. For static type analysis for the control flow, we need so called skeleton predicates that implement these loop checks before each split off loop. If we do not do this static analysis we generally get `bad graph` asserts, as only removing data flow and not control flow leads to broken graphs. > > This was already implemented for pre/main/post loops and loop unswitching, but not for peeling. > > Ran large test suite. > Manual inspection shows that the instantiated skeleton predicate indeed collapses, in the provided regression test. > > Rerunning some tests now... Looks good to me. ------------- Marked as reviewed by roland (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8783 From aph at openjdk.java.net Tue May 31 07:55:29 2022 From: aph at openjdk.java.net (Andrew Haley) Date: Tue, 31 May 2022 07:55:29 GMT Subject: RFR: JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance [v2] In-Reply-To: References: Message-ID: On Tue, 31 May 2022 07:47:24 GMT, Zhuojun Miao wrote: >> Since MacroAssembler added merge_ldst, we can use different >> destination registers for contiguous-memory LDR instructions to improve performance. > > Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8287349 > - use ldp explicitly > - JDK-8287349: Merge LDR instructions to improve C1 OSR performance Looks fine. ------------- Marked as reviewed by aph (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8933 From ngasson at openjdk.java.net Tue May 31 07:55:29 2022 From: ngasson at openjdk.java.net (Nick Gasson) Date: Tue, 31 May 2022 07:55:29 GMT Subject: RFR: JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance [v2] In-Reply-To: References: Message-ID: On Tue, 31 May 2022 07:47:24 GMT, Zhuojun Miao wrote: >> Since MacroAssembler added merge_ldst, we can use different >> destination registers for contiguous-memory LDR instructions to improve performance. > > Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into JDK-8287349 > - use ldp explicitly > - JDK-8287349: Merge LDR instructions to improve C1 OSR performance Marked as reviewed by ngasson (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8933 From zmiao at openjdk.java.net Tue May 31 07:55:30 2022 From: zmiao at openjdk.java.net (Zhuojun Miao) Date: Tue, 31 May 2022 07:55:30 GMT Subject: RFR: JDK-8287349: AArch64: Merge LDR instructions to improve C1 OSR performance [v2] In-Reply-To: References: Message-ID: <3j5jy8pQCydGZrwlHH-banRg-RlWvS1LohLdGglYpFM=.a82575e9-890e-41d9-9650-df9933133291@github.com> On Sat, 28 May 2022 06:41:33 GMT, Andrew Haley wrote: >> Zhuojun Miao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: >> >> - Merge branch 'master' into JDK-8287349 >> - use ldp explicitly >> - JDK-8287349: Merge LDR instructions to improve C1 OSR performance > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 288: > >> 286: __ ldr(r20, Address(OSR_buf, slot_offset + 1*BytesPerWord)); >> 287: __ str(r19, frame_map()->address_for_monitor_lock(i)); >> 288: __ str(r20, frame_map()->address_for_monitor_object(i)); > > I think it would be better to use `ldp` explicitly here, rather than relying on the assembler to do the merge. OK. ------------- PR: https://git.openjdk.java.net/jdk/pull/8933 From chagedorn at openjdk.java.net Tue May 31 08:21:42 2022 From: chagedorn at openjdk.java.net (Christian Hagedorn) Date: Tue, 31 May 2022 08:21:42 GMT Subject: RFR: 8283466: C2: missing skeleton predicates in peeled loop In-Reply-To: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> References: <_DPfzm_6zsaUDuRyXXHK_rZYVTMYmJ_JMcUpoBfb6kA=.13ac6f8b-5aba-42b9-805e-174c83e45816@github.com> Message-ID: On Thu, 19 May 2022 08:56:22 GMT, Emanuel Peter wrote: > Implemented initializing skeleton predicates for the peeled loop. > > We have some predicates / checks before loops that are dependent on the range of the loop, they are checked at runtime. When we split a loop (eg. peeling, pre/main/post, unswitching) one of the sub-loops may get impossible data types and remove the data flow. For static type analysis for the control flow, we need so called skeleton predicates that implement these loop checks before each split off loop. If we do not do this static analysis we generally get `bad graph` asserts, as only removing data flow and not control flow leads to broken graphs. > > This was already implemented for pre/main/post loops and loop unswitching, but not for peeling. > > Ran large test suite. > Manual inspection shows that the instantiated skeleton predicate indeed collapses, in the provided regression test. > > Rerunning some tests now... Looks good to me, too! ------------- Marked as reviewed by chagedorn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8783 From fgao at openjdk.java.net Tue May 31 08:33:50 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 31 May 2022 08:33:50 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 Can I have a second review please, @DamonFool ? ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From rcastanedalo at openjdk.java.net Tue May 31 08:40:58 2022 From: rcastanedalo at openjdk.java.net (Roberto =?UTF-8?B?Q2FzdGHDsWVkYQ==?= Lozano) Date: Tue, 31 May 2022 08:40:58 GMT Subject: Integrated: 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors In-Reply-To: References: Message-ID: On Fri, 27 May 2022 12:28:53 GMT, Roberto Casta?eda Lozano wrote: > IGV scheduling crashes when breaking critical edges that target Region nodes not marked with the `is_block_start` property, by failing to create an appropriate basic block between the source and the destination of the critical edge (see the JBS bug report for more detail). This changeset ensures that such a basic block is created even when the destination node is not marked with `is_block_start`. > > #### Testing > > - Tested manually on the [graph](https://bugs.openjdk.java.net/secure/attachment/99125/failure.zip) reported in the JBS issue. > > - Tested automatically that scheduling tens of thousands of graphs (by instrumenting IGV to schedule parsed graphs eagerly and running `java -Xcomp -XX:-TieredCompilation -XX:PrintIdealGraphLevel=4`) does trigger any exception or assertion failure. This pull request has now been integrated. Changeset: 6e55a72f Author: Roberto Casta?eda Lozano URL: https://git.openjdk.java.net/jdk/commit/6e55a72f25f7273e3a8a19e0b9a97669b84808e9 Stats: 8 lines in 1 file changed: 6 ins; 0 del; 2 mod 8287438: IGV: scheduling crashes on non-block-start Region with multiple predecessors Reviewed-by: kvn, chagedorn ------------- PR: https://git.openjdk.java.net/jdk/pull/8921 From fgao at openjdk.java.net Tue May 31 09:10:41 2022 From: fgao at openjdk.java.net (Fei Gao) Date: Tue, 31 May 2022 09:10:41 GMT Subject: RFR: 8286972: Support the new loop induction variable related PopulateIndex IR node on x86 [v5] In-Reply-To: References: Message-ID: On Fri, 20 May 2022 05:09:41 GMT, Sandhya Viswanathan wrote: >> This PR adds x86 backend support for the new loop induction variable related PopulateIndex IR node. >> This IR node was added as part of [JDK-8280510](https://bugs.openjdk.java.net/browse/JDK-8280510). >> >> The performance numbers are as follows: >> Before: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 64556.552 ? 1126.396 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 22117.050 ? 11452.098 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 117776.383 ? 1120.957 ops/s >> >> After: >> Benchmark (count) Mode Cnt Score Error Units >> IndexVector.exprWithIndex1 65536 thrpt 3 203180.290 ? 2147.807 ops/s >> IndexVector.exprWithIndex2 65536 thrpt 3 274132.756 ? 6853.393 ops/s >> IndexVector.indexArrayFill 65536 thrpt 3 374165.202 ? 46930.779 ops/s >> >> Please review. >> >> Best Regards, >> Sandhya > > Sandhya Viswanathan has updated the pull request incrementally with one additional commit since the last revision: > > review comment resolution src/hotspot/cpu/x86/x86.ad line 8270: > 8268: ins_encode %{ > 8269: assert($src2$$constant == 1, "required"); > 8270: int vlen = Matcher::vector_length(this); May I ask why use `Matcher::vector_length()` here, rather than `Matcher::vector_length_in_bytes()`, for `load_iota_indices()`? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8778 From pli at openjdk.java.net Tue May 31 09:17:49 2022 From: pli at openjdk.java.net (Pengfei Li) Date: Tue, 31 May 2022 09:17:49 GMT Subject: RFR: 8283307: Vectorize unsigned shift right on signed subword types [v3] In-Reply-To: References: Message-ID: On Fri, 22 Apr 2022 11:09:09 GMT, Fei Gao wrote: >> public short[] vectorUnsignedShiftRight(short[] shorts) { >> short[] res = new short[SIZE]; >> for (int i = 0; i < SIZE; i++) { >> res[i] = (short) (shorts[i] >>> 3); >> } >> return res; >> } >> >> In C2's SLP, vectorization of unsigned shift right on signed subword types (byte/short) like the case above is intentionally disabled[1]. Because the vector unsigned shift on signed subword types behaves differently from the Java spec. It's worthy to vectorize more cases in quite low cost. Also, unsigned shift right on signed subword is not uncommon and we may find similar cases in Lucene benchmark[2]. >> >> Taking unsigned right shift on short type as an example, >> ![image](https://user-images.githubusercontent.com/39403138/160313924-6bded802-c135-48db-98b8-7c5f43d8ff54.png) >> >> when the shift amount is a constant not greater than the number of sign extended bits, 16 higher bits for short type shown like >> above, the unsigned shift on signed subword types can be transformed into a signed shift and hence becomes vectorizable. Here is the transformation: >> ![image](https://user-images.githubusercontent.com/39403138/160314151-30249bfc-bdfc-4700-b4fb-97617b45184b.png) >> >> This patch does the transformation in `SuperWord::implemented()` and `SuperWord::output()`. It helps vectorize the short cases above. We can handle unsigned right shift on byte type in a similar way. The generated assembly code for one iteration on aarch64 is like: >> >> ... >> sbfiz x13, x10, #1, #32 >> add x15, x11, x13 >> ldr q16, [x15, #16] >> sshr v16.8h, v16.8h, #3 >> add x13, x17, x13 >> str q16, [x13, #16] >> ... >> >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~80% improvement with this patch. >> >> The perf data on AArch64: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op >> urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op >> >> after the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op >> urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op >> >> The perf data on X86: >> Before the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op >> urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op >> >> After the patch: >> Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units >> urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op >> urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op >> >> [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 >> [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains five additional commits since the last revision: > > - Rewrite the scalar calculation to avoid inline > > Change-Id: I5959d035278097de26ab3dfe6f667d6f7476c723 > - Merge branch 'master' into fg8283307 > > Change-Id: Id3ec8594da49fb4e6c6dcad888bcb1dfc0aac303 > - Remove related comments in some test files > > Change-Id: I5dd1c156bd80221dde53737e718da0254c5381d8 > - Merge branch 'master' into fg8283307 > > Change-Id: Ic4645656ea156e8cac993995a5dc675aa46cb21a > - 8283307: Vectorize unsigned shift right on signed subword types > > ``` > public short[] vectorUnsignedShiftRight(short[] shorts) { > short[] res = new short[SIZE]; > for (int i = 0; i < SIZE; i++) { > res[i] = (short) (shorts[i] >>> 3); > } > return res; > } > ``` > In C2's SLP, vectorization of unsigned shift right on signed > subword types (byte/short) like the case above is intentionally > disabled[1]. Because the vector unsigned shift on signed > subword types behaves differently from the Java spec. It's > worthy to vectorize more cases in quite low cost. Also, > unsigned shift right on signed subword is not uncommon and we > may find similar cases in Lucene benchmark[2]. > > Taking unsigned right shift on short type as an example, > > Short: > | <- 16 bits -> | <- 16 bits -> | > | 1 1 1 ... 1 1 | data | > > when the shift amount is a constant not greater than the number > of sign extended bits, 16 higher bits for short type shown like > above, the unsigned shift on signed subword types can be > transformed into a signed shift and hence becomes vectorizable. > Here is the transformation: > > For T_SHORT (shift <= 16): > src RShiftCntV shift src RShiftCntV shift > \ / ==> \ / > URShiftVS RShiftVS > > This patch does the transformation in SuperWord::implemented() and > SuperWord::output(). It helps vectorize the short cases above. We > can handle unsigned right shift on byte type in a similar way. The > generated assembly code for one iteration on aarch64 is like: > ``` > ... > sbfiz x13, x10, #1, #32 > add x15, x11, x13 > ldr q16, [x15, #16] > sshr v16.8h, v16.8h, #3 > add x13, x17, x13 > str q16, [x13, #16] > ... > ``` > > Here is the performance data for micro-benchmark before and after > this patch on both AArch64 and x64 machines. We can observe about > ~80% improvement with this patch. > > The perf data on AArch64: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 295.711 ? 0.117 ns/op > urShiftImmShort 1024 3 avgt 5 284.559 ? 0.148 ns/op > > after the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 45.111 ? 0.047 ns/op > urShiftImmShort 1024 3 avgt 5 55.294 ? 0.072 ns/op > > The perf data on X86: > Before the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 361.374 ? 4.621 ns/op > urShiftImmShort 1024 3 avgt 5 365.390 ? 3.595 ns/op > > After the patch: > Benchmark (SIZE) (shiftCount) Mode Cnt Score Error Units > urShiftImmByte 1024 3 avgt 5 105.489 ? 0.488 ns/op > urShiftImmShort 1024 3 avgt 5 43.400 ? 0.394 ns/op > > [1] https://github.com/openjdk/jdk/blob/002e3667443d94e2303c875daf72cf1ccbbb0099/src/hotspot/share/opto/vectornode.cpp#L190 > [2] https://github.com/jpountz/decode-128-ints-benchmark/ > > Change-Id: I9bd0cfdfcd9c477e8905a4c877d5e7ff14e39161 LGTM except my concern in the IR test. test/hotspot/jtreg/compiler/c2/irTests/TestVectorizeURShiftSubword.java line 36: > 34: * @key randomness > 35: * @summary Auto-vectorization enhancement for unsigned shift right on signed subword types > 36: * @requires os.arch=="amd64" | os.arch=="x86_64" | os.arch=="aarch64" This IR test for vectorizable check looks good on AArch64. But AFAIK, some operations cannot be vectorized on old x86 CPUs with AVX=1. Could you add something like `(os.simpleArch == "x64" & vm.cpu.features ~= ".*avx2.*")` to check the CPU feature? ------------- PR: https://git.openjdk.java.net/jdk/pull/7979 From duke at openjdk.java.net Tue May 31 10:16:17 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 31 May 2022 10:16:17 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v21] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with two additional commits since the last revision: - add more examples in comments - fix int size_t conversion for windows ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/65144f95..9eef940f Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=20 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=19-20 Stats: 20 lines in 1 file changed: 14 ins; 4 del; 2 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From duke at openjdk.java.net Tue May 31 11:22:47 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 31 May 2022 11:22:47 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v22] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: remove unecessary newlines ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/9eef940f..86be394a Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=21 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=20-21 Stats: 3 lines in 1 file changed: 0 ins; 3 del; 0 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From jiefu at openjdk.java.net Tue May 31 13:12:47 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 31 May 2022 13:12:47 GMT Subject: RFR: 8282470: Eliminate useless sign extension before some subword integer operations [v3] In-Reply-To: References: Message-ID: On Thu, 26 May 2022 06:18:33 GMT, Fei Gao wrote: >> Some loop cases of subword types, including byte and short, can't be vectorized by C2's SLP. Here is an example: >> >> short[] addShort(short[] a, short[] b, short[] c) { >> for (int i = 0; i < SIZE; i++) { >> b[i] = (short) (a[i] + 8); // line A >> sres[i] = (short) (b[i] + c[i]); // line B >> } >> } >> >> However, similar cases of int/float/double/long/char type can be vectorized successfully. >> >> The reason why SLP can't vectorize the short case above is that, as illustrated here[1], the result of the scalar add operation on *line A* has been promoted to int type. It needs to be narrowed to short type first before it can work as one of source operands of addition on *line B*. The demotion is done by left-shifting 16 bits then right-shifting 16 bits. The ideal graph for the process is showed like below. >> ![image](https://user-images.githubusercontent.com/39403138/160074255-c751f84b-6511-4b56-927b-53fb512cf51b.png) >> >> In SLP, for most short-type cases, we can determine the precise type of the scalar int-type operation and finally execute it with short-type vector operations[2], except rshift opcode and abs in some situations[3]. But in this case, the source operand of RShiftI is from LShiftI rather than from any LoadS[4], so we can't determine its real type and conservatively assign it with int type rather than real short type. The int-type opearation RShiftI here can't be vectorized together with other short-type operations, like AddI(line B). The reason for byte loop cases is the same. Similar loop cases of char type could be vectorized because its demotion from int to char is done by `and` with mask rather than `lshift_rshift`. >> >> Therefore, we try to remove the patterns like `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short cases, to vectorize more scenarios. Optimizing it in the mid-end by i-GVN is more reasonable. >> >> What we do in the mid-end is eliminating the sign extension before some subword integer operations like: >> >> >> int x, y; >> short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 >> >> to >> >> short s = (short) (x OP y); >> >> >> In the patch, assuming that `x` can be any int number, we need guarantee that the optimization doesn't have any impact on result. Not all arithmetic logic OPs meet the requirements. For example, assuming that `Imm` equals `16`, `x` equals `131068`, >> `y` equals `50` and `OP` is division`/`, `short s = (short) (((131068 << 16) >> 16) / 50)` is not equal to `short s = (short) (131068 / 50)`. When OP is division, we may get different result with or without demotion before OP, because the upper 16 bits of division may have influence on the lower 16 bits of result, which can't be optimized. All optimizable opcodes are listed in StoreNode::no_need_sign_extension(), whose upper 16 bits of src operands don't influence the lower 16 bits of result for short >> type and upper 24 bits of src operand don't influence the lower 8 bits of dst operand for byte. >> >> After the patch, the short loop case above can be vectorized as: >> >> movi v18.8h, #0x8 >> ... >> ldr q16, [x14, #32] // vector load a[i] >> // vector add, a[i] + 8, no promotion or demotion >> add v17.8h, v16.8h, v18.8h >> str q17, [x6, #32] // vector store a[i] + 8, b[i] >> ldr q17, [x0, #32] // vector load c[i] >> // vector add, a[i] + c[i], no promotion or demotion >> add v16.8h, v17.8h, v16.8h >> // vector add, a[i] + c[i] + 8, no promotion or demotion >> add v16.8h, v16.8h, v18.8h >> str q16, [x11, #32] //vector store sres[i] >> ... >> >> >> The patch works for byte cases as well. >> >> Here is the performance data for micro-benchmark before and after this patch on both AArch64 and x64 machines. We can observe about ~83% improvement with this patch. >> >> on AArch64: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 401.521 ? 0.033 ns/op >> addS 523 avgt 15 401.512 ? 0.021 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 68.444 ? 0.318 ns/op >> addS 523 avgt 15 69.847 ? 0.043 ns/op >> >> on x86: >> Before the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 454.102 ? 36.180 ns/op >> addS 523 avgt 15 432.245 ? 22.640 ns/op >> >> After the patch: >> Benchmark (length) Mode Cnt Score Error Units >> addB 523 avgt 15 75.812 ? 5.063 ns/op >> addS 523 avgt 15 72.839 ? 10.109 ns/op >> >> [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 >> [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 >> [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 >> [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Fei Gao has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains three additional commits since the last revision: > > - Merge branch 'master' into fg8282470 > > Change-Id: I180f1c85bd407b3d7e05937450c5fc0f81e6d70b > - Merge branch 'master' into fg8282470 > > Change-Id: I877ba1e9a82c0dbef04df08070223c02400eeec7 > - 8282470: Eliminate useless sign extension before some subword integer operations > > Some loop cases of subword types, including byte and > short, can't be vectorized by C2's SLP. Here is an example: > ``` > short[] addShort(short[] a, short[] b, short[] c) { > for (int i = 0; i < SIZE; i++) { > b[i] = (short) (a[i] + 8); // *line A* > sres[i] = (short) (b[i] + c[i]); // *line B* > } > } > ``` > However, similar cases of int/float/double/long/char type can > be vectorized successfully. > > The reason why SLP can't vectorize the short case above is > that, as illustrated here[1], the result of the scalar add > operation on *line A* has been promoted to int type. It needs > to be narrowed to short type first before it can work as one > of source operands of addition on *line B*. The demotion is > done by left-shifting 16 bits then right-shifting 16 bits. > The ideal graph for the process is showed like below. > > LoadS a[i] 8 > \ / > AddI (line A) > / \ > StoreC b[i] Lshift 16bits > \ > RShiftI 16 bits LoadS c[i] > \ / > AddI (line B) > \ > StoreC sres[i] > > In SLP, for most short-type cases, we can determine the precise > type of the scalar int-type operation and finally execute it > with short-type vector operations[2], except rshift opcode and > abs in some situations[3]. But in this case, the source operand > of RShiftI is from LShiftI rather than from any LoadS[4], so we > can't determine its real type and conservatively assign it with > int type rather than real short type. The int-type opearation > RShiftI here can't be vectorized together with other short-type > operations, like AddI(line B). The reason for byte loop cases > is the same. Similar loop cases of char type could be > vectorized because its demotion from int to char is done by > `and` with mask rather than `lshift_rshift`. > > Therefore, we try to remove the patterns like > `RShiftI _ (LShiftI _ valIn1 conIL ) conIR` in the byte/short > cases, to vectorize more scenarios. Optimizing it in the > mid-end by i-GVN is more reasonable. > > What we do in the mid-end is eliminating the sign extension > before some subword integer operations like: > > ``` > int x, y; > short s = (short) (((x << Imm) >> Imm) OP y); // Imm <= 16 > ``` > to > ``` > short s = (short) (x OP y); > ``` > > In the patch, assuming that `x` can be any int number, we need > guarantee that the optimization doesn't have any impact on > result. Not all arithmetic logic OPs meet the requirements. For > example, assuming that `Imm` equals `16`, `x` equals `131068`, > `y` equals `50` and `OP` is division`/`, > `short s = (short) (((131068 << 16) >> 16) / 50)` is not > equal to `short s = (short) (131068 / 50)`. When OP is division, > we may get different result with or without demotion > before OP, because the upper 16 bits of division may have > influence on the lower 16 bits of result, which can't be > optimized. All optimizable opcodes are listed in > StoreNode::no_need_sign_extension(), whose upper 16 bits of src > operands don't influence the lower 16 bits of result for short > type and upper 24 bits of src operand don't influence the lower > 8 bits of dst operand for byte. > > After the patch, the short loop case above can be vectorized as: > ``` > movi v18.8h, #0x8 > ... > ldr q16, [x14, #32] // vector load a[i] > // vector add, a[i] + 8, no promotion or demotion > add v17.8h, v16.8h, v18.8h > str q17, [x6, #32] // vector store a[i] + 8, b[i] > ldr q17, [x0, #32] // vector load c[i] > // vector add, a[i] + c[i], no promotion or demotion > add v16.8h, v17.8h, v16.8h > // vector add, a[i] + c[i] + 8, no promotion or demotion > add v16.8h, v16.8h, v18.8h > str q16, [x11, #32] //vector store sres[i] > ... > ``` > > The patch works for byte cases as well. > > Here is the performance data for micro-benchmark before > and after this patch on both AArch64 and x64 machines. > We can observe about ~83% improvement with this patch. > > on AArch64: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 401.521 ? 0.033 ns/op > addS 523 avgt 15 401.512 ? 0.021 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 68.444 ? 0.318 ns/op > addS 523 avgt 15 69.847 ? 0.043 ns/op > > on x86: > Before the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 454.102 ? 36.180 ns/op > addS 523 avgt 15 432.245 ? 22.640 ns/op > > After the patch: > Benchmark (length) Mode Cnt Score Error Units > addB 523 avgt 15 75.812 ? 5.063 ns/op > addS 523 avgt 15 72.839 ? 10.109 ns/op > > [1]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3241 > [2]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3206 > [3]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3249 > [4]: https://github.com/openjdk/jdk/blob/6013d09e82693a1c07cf0bf6daffd95114b3cbfa/src/hotspot/share/opto/superword.cpp#L3251 > > Change-Id: I92ce42b550ef057964a3b58716436735275d8d31 This patch can auto-vectorize `testShort1`, but fail to do so with `testShort2`. public static void testShort1(short[] a, short[] b, short[] c, short[] d) { for (int i = 0; i < a.length; i++) { b[i] = (short)(a[i] + 1); d[i] = (short)(b[i] + c[i]); } } public static void testShort2(short[] a, short[] b, short[] c, short[] d) { for (int i = 0; i < a.length; i++) { b[i] = (short)(a[i] + 1); d[i] = (short)(b[i] + c[i] + 1); } } So may I ask if it's possible to also auto-vectorize `testShort2` ? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/7954 From roland at openjdk.java.net Tue May 31 14:55:02 2022 From: roland at openjdk.java.net (Roland Westrelin) Date: Tue, 31 May 2022 14:55:02 GMT Subject: RFR: 8287227: Shenandoah: A couple of virtual thread tests failed with iu mode even without Loom enabled. Message-ID: With JDK-8277654, the load barrier slow path call doesn't produce raw memory anymore but the IU barrier call still does. I propose removing raw memory for that call too which also causes the assert that fails to be removed. ------------- Commit messages: - fix Changes: https://git.openjdk.java.net/jdk/pull/8958/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8958&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287227 Stats: 290 lines in 2 files changed: 0 ins; 287 del; 3 mod Patch: https://git.openjdk.java.net/jdk/pull/8958.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8958/head:pull/8958 PR: https://git.openjdk.java.net/jdk/pull/8958 From psandoz at openjdk.java.net Tue May 31 15:39:49 2022 From: psandoz at openjdk.java.net (Paul Sandoz) Date: Tue, 31 May 2022 15:39:49 GMT Subject: RFR: 8283894: Intrinsify compress and expand bits on x86 [v8] In-Reply-To: References: Message-ID: On Mon, 30 May 2022 06:56:27 GMT, Jatin Bhateja wrote: >> Summary of changes: >> >> - Patch intrinsifies following newly added Java SE APIs >> - Integer.compress >> - Integer.expand >> - Long.compress >> - Long.expand >> >> - Adds C2 IR nodes and corresponding ideal transformations for new operations. >> - We see around ~10x performance speedup due to intrinsification over X86 target. >> - Adds an IR framework based test to validate newly introduced IR transformations. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 12 additional commits since the last revision: > > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Extending new IR value routines with value propagation logic. > - 8283894: Disabling sanity test as per review suggestion. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - 8283894: Removing CompressExpandSanityTest from problem list. > - 8283894: Updating test tag spec. > - 8283894: Review comments resolved. > - 8283894: Add missing -XX:+UnlockDiagnosticVMOptions. > - 8283894: Review comments resolutions. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8283894 > - ... and 2 more: https://git.openjdk.java.net/jdk/compare/3fae08cb...a36dba2e Marked as reviewed by psandoz (Reviewer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8498 From ysuenaga at openjdk.java.net Tue May 31 15:49:09 2022 From: ysuenaga at openjdk.java.net (Yasumasa Suenaga) Date: Tue, 31 May 2022 15:49:09 GMT Subject: RFR: 8287491: compiler/jvmci/errors/TestInvalidDebugInfo.java fails new assert: assert((uint)t < T_CONFLICT + 1) failed: invalid type # Message-ID: We saw new assertion error after [JDK-8286562](https://bugs.openjdk.java.net/browse/JDK-8286562) ( #8646 ) # A fatal error has been detected by the Java Runtime Environment: # # Internal Error (/home/ysuenaga/github-forked/jdk/src/hotspot/share/utilities/globalDefinitions.hpp:735), pid=2619, tid=2635 # assert((uint)t < T_CONFLICT + 1) failed: invalid type It was caused by passing `JavaKind.Illegal` to slot kind at TestInvalidDebugInfo.java . We should pass `JavaKind.Void` instead of `JavKind.Illegal`. See JBS for more details (I attached hs_err log). ------------- Commit messages: - Remove TestInvalidDebugInfo from ProblemList - 8287491: compiler/jvmci/errors/TestInvalidDebugInfo.java fails new assert: assert((uint)t < T_CONFLICT + 1) failed: invalid type # Changes: https://git.openjdk.java.net/jdk/pull/8954/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8954&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287491 Stats: 5 lines in 2 files changed: 0 ins; 1 del; 4 mod Patch: https://git.openjdk.java.net/jdk/pull/8954.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8954/head:pull/8954 PR: https://git.openjdk.java.net/jdk/pull/8954 From jbhateja at openjdk.java.net Tue May 31 16:04:55 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 31 May 2022 16:04:55 GMT Subject: RFR: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) [v10] In-Reply-To: References: Message-ID: On Wed, 25 May 2022 06:29:23 GMT, Jatin Bhateja wrote: >> Hi All, >> >> Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) >> >> Following is the brief summary of changes:- >> >> 1) Extends the scope of existing lanewise API for following new vector operations. >> - VectorOperations.BIT_COUNT: counts the number of one-bits >> - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits >> - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits >> - VectorOperations.REVERSE: reversing the order of bits >> - VectorOperations.REVERSE_BYTES: reversing the order of bytes >> - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. >> >> 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. >> - Vector.compress >> - Vector.expand >> - VectorMask.compress >> >> 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. >> - Vector.fromMemorySegment >> - Vector.intoMemorySegment >> >> 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. >> >> >> Patch has been regressed over AARCH64 and X86 targets different AVX levels. >> >> Kindly review and share your feedback. >> >> Best Regards, >> Jatin > > Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 20 commits: > > - 8284960: Post merge cleanups. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Review comments resolved. > - 8284960: Integrating incremental patches. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Changes to enable jdk.incubator.vector to be treated as preview participant. Code re-organization related to Reverse/ReverseByte IR transforms. > - 8284960: Adding --enable-preview in vectorAPI benchmarks. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - 8284960: Review comments resolution. > - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8284960 > - ... and 10 more: https://git.openjdk.java.net/jdk/compare/742644e2...0f6e1584 Thanks reviewers for your comments. ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From jbhateja at openjdk.java.net Tue May 31 16:04:58 2022 From: jbhateja at openjdk.java.net (Jatin Bhateja) Date: Tue, 31 May 2022 16:04:58 GMT Subject: Integrated: 8284960: Integration of JEP 426: Vector API (Fourth Incubator) In-Reply-To: References: Message-ID: On Wed, 27 Apr 2022 11:03:48 GMT, Jatin Bhateja wrote: > Hi All, > > Patch adds the planned support for new vector operations and APIs targeted for [JEP 426: Vector API (Fourth Incubator).](https://bugs.openjdk.java.net/browse/JDK-8280173) > > Following is the brief summary of changes:- > > 1) Extends the scope of existing lanewise API for following new vector operations. > - VectorOperations.BIT_COUNT: counts the number of one-bits > - VectorOperations.LEADING_ZEROS_COUNT: counts the number of leading zero bits > - VectorOperations.TRAILING_ZEROS_COUNT: counts the number of trailing zero bits > - VectorOperations.REVERSE: reversing the order of bits > - VectorOperations.REVERSE_BYTES: reversing the order of bytes > - compress and expand bits: Semantics are based on Hacker's Delight section 7-4 Compress, or Generalized Extract. > > 2) Adds following new APIs to perform cross lane vector compress and expansion operations under the influence of a mask. > - Vector.compress > - Vector.expand > - VectorMask.compress > > 3) Adds predicated and non-predicated versions of following new APIs to load and store the contents of vector from foreign MemorySegments. > - Vector.fromMemorySegment > - Vector.intoMemorySegment > > 4) C2 Compiler IR enhancements and optimized X86 and AARCH64 backend support for each newly added operation. > > > Patch has been regressed over AARCH64 and X86 targets different AVX levels. > > Kindly review and share your feedback. > > Best Regards, > Jatin This pull request has now been integrated. Changeset: 6f6486e9 Author: Jatin Bhateja URL: https://git.openjdk.java.net/jdk/commit/6f6486e97743eadfb20b4175e1b4b2b05b59a17a Stats: 38021 lines in 228 files changed: 16652 ins; 16924 del; 4445 mod 8284960: Integration of JEP 426: Vector API (Fourth Incubator) Co-authored-by: Jatin Bhateja Co-authored-by: Paul Sandoz Co-authored-by: Sandhya Viswanathan Co-authored-by: Smita Kamath Co-authored-by: Joshua Zhu Co-authored-by: Xiaohong Gong Co-authored-by: John R Rose Co-authored-by: Eric Liu Co-authored-by: Ningsheng Jian Reviewed-by: ngasson, vlivanov, mcimadamore, jlahoda, kvn ------------- PR: https://git.openjdk.java.net/jdk/pull/8425 From duke at openjdk.java.net Tue May 31 16:39:40 2022 From: duke at openjdk.java.net (Emanuel Peter) Date: Tue, 31 May 2022 16:39:40 GMT Subject: RFR: 8283775: better dump: VM support for graph querying in debugger with BFS traversal and node filtering [v23] In-Reply-To: References: Message-ID: > **What this gives you for the debugger** > - BFS traversal (inputs / outputs) > - node filtering by category > - shortest path between nodes > - all paths between nodes > - readability in terminal: alignment, sorting by node idx, distance to start, and colors (optional) > - and more > > **Some usecases** > - more readable `dump` > - follow only nodes of some categories (only control, only data, etc) > - find which control nodes depend on data node (visit data nodes, include control in boundary) > - how two nodes relate (shortest / all paths, following input/output nodes, or both) > - find loops (control / memory / data: call all paths with node as start and target) > > **Description** > I implemented VM support for BFS traversal of IR nodes, where one can filter for input/output edges, and the node-type (control / data / memory). If one specifies the `target` node, we find a shortest path between `this` and `target`. With the `options` string one can easily select which node types to visit (`cdmxo`) and which to include only in the boundary (`CDMXO`). To find all paths between two nodes, include the letter `A` in the options string. > > `void Node::dump_bfs(const int max_distance, Node* target, char const* options)` > > While a sufficient summary and description is in the comments above the function definition, I want to explain some useful use-cases here. > > Please let me know if you would find this helpful, or if you have any feedback to improve it. > Thanks, Emanuel > > PS: I do plan to refactor the `dump` code in `node.cpp` to use my new infrastructure. I will also remove `Node::related` and `dump_related,` since it has not been properly extended and maintained. But that refactoring would risk messing with tools that depend on `dump`, which I would like to avoid for now, and do that in a second step. > > **1. Better dump()** > The function is similar to `dump()`, we can also follow input / output edges up to a certain distance. There are a few improvements/added features: > > 1. Display the distance to the specified node. This way I can quickly assess how many nodes I have at a certain distance, and I can traverse edges from one distance to the next. > 2. Choose if you want to traverse only input `+` or output `-` edges, or both `+-`. > 3. Node filtering with types taken from `type->category()` help one limit traversal to control, data or memory flow. The categories are the same as in IGV. > 4. Separate visit / boundary filters by node type: traverse graph visiting only some node types (eg. data). On the boundary, also display but do not traverse nodes allowed by boundary filter (eg. control). This can be useful to traverse outputs of a data node recursively, and see what control nodes depend on it. Use `dcmxo` for visit filter, and `DCMXO` for boundary filter. > 5. Probably controversial: coloring of node types in terminal. By default it is off. May not work on all terminals. But I find it very useful to quickly separate memory, data and control - just as in IGV - but in my terminal! Highly recommend putting the `#` in the options string! To more easily trace chains of nodes, I highlight the node idx of all nodes that are displayed in their respective colors. > 6. After matching, we create Mach nodes from the IR nodes. For most Mach nodes, there is an associated old node. Often an assert is triggered for Mach nodes, but the graph already breaks in the IR graph. It is thus very helpful to have the old node idx displayed. Use `@` in options string. > 7. I also display the head node idx of the block a Mach node is scheduled for. This can be helpful if one gets bad dominance asserts, and needs to see why the blocks were scheduled in a bad way. Use `B` in options string. > 8. Some people like the displayed nodes to be sorted by node idx. Simply add an `S` to the option string! > > Example (BFS inputs): > > (rr) p find_node(161)->dump_bfs(2,0,"dcmxo+") > d dump > --------------------------------------------- > 2 159 CmpI === _ 137 40 [[ 160 ]] !orig=[144] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 1 160 Bool === _ 159 [[ 161 ]] [lt] !orig=[145] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 0 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > > > Example (BFS control inputs): > > (rr) p find_node(163)->dump_bfs(5,0,"c+") > d dump > --------------------------------------------- > 5 147 IfTrue === 161 [[ 166 ]] #1 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 5 165 OuterStripMinedLoop === 165 93 164 [[ 165 166 ]] > 4 166 CountedLoop === 166 165 147 [[ 166 161 102 103 ]] stride: 1 strip mined !orig=[157],[99] !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 3 161 CountedLoopEnd === 166 160 [[ 162 147 ]] [lt] P=0.957374, C=19675.000000 !orig=[146] !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 2 162 IfFalse === 161 [[ 167 168 ]] #0 !orig=148 !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 1 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 0 163 OuterStripMinedLoopEnd === 167 22 [[ 164 148 ]] P=0.957374, C=19675.000000 > > We see the control flow of a strip mined loop. > > > Experiment (BFS only data, but display all nodes on boundary) > > (rr) p find_node(102)->dump_bfs(10,0,"dCDMOX-") > d dump > --------------------------------------------- > 0 102 Phi === 166 22 136 [[ 133 132 ]] #int !jvms: StringLatin1::hashCode @ bci:16 (line 193) > 1 133 SubI === _ 132 102 [[ 136 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 1 132 LShiftI === _ 102 131 [[ 133 ]] !jvms: StringLatin1::hashCode @ bci:25 (line 194) > 2 136 AddI === _ 133 155 [[ 153 167 102 ]] !jvms: StringLatin1::hashCode @ bci:32 (line 194) > 3 153 Phi === 53 136 22 [[ 154 ]] #int !jvms: StringLatin1::hashCode @ bci:13 (line 193) > 3 167 SafePoint === 162 1 7 1 1 168 1 136 37 40 137 1 [[ 163 ]] SafePoint !orig=138 !jvms: StringLatin1::hashCode @ bci:37 (line 193) > 4 154 Return === 53 6 7 8 9 returns 153 [[ 0 ]] > > We see the dependent output nodes of the data-phi 102, we see that a SafePoint and the Return depend on it. Here colors are really helpful, as it makes it easy to separate the data-nodes (blue) from the boundary-nodes (other colors). > > Example with Mach nodes: > > (rr) p find_node(112)->dump_bfs(2,0,"cdmxo+#@B") > d [head idom d] old dump > --------------------------------------------- > 2 534 505 6 o1871 109 addI_rReg_imm === _ 44 [[ 110 102 113 230 327 ]] #-3/0xfffffffd > 2 536 537 15 o186 139 addI_rReg_imm === _ 137 [[ 140 137 113 144 ]] #4/0x00000004 !jvms: StringLatin1::replace @ bci:13 (line 303) > 2 537 538 14 o179 114 IfTrue === 115 [[ 536 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 536 537 15 o739 113 compI_rReg === _ 139 109 [[ 112 ]] > 1 536 537 15 _ 536 Region === 536 114 [[ 536 112 ]] > 0 536 537 15 o741 112 jmpLoopEnd === 536 113 [[ 134 111 ]] P=0.993611, C=7200.000000 !jvms: StringLatin1::replace @ bci:19 (line 303) > > And the query on the old nodes: > > (rr) p find_old_node(741)->dump_bfs(2,0,"cdmxo+#") > d dump > --------------------------------------------- > 2 o1871 AddI === _ o79 o1872 [[ o739 o1948 o761 o1477 ]] > 2 o186 AddI === _ o1756 o1714 [[ o1756 o739 o1055 ]] > 2 o178 If === o1159 o177 o176 [[ o179 o180 ]] P=0.800503, C=7153.000000 > 1 o739 CmpI === _ o186 o1871 [[ o740 o741 ]] > 1 o740 Bool === _ o739 [[ o741 ]] [lt] > 1 o179 IfTrue === o178 [[ o741 ]] #1 > 0 o741 CountedLoopEnd === o179 o740 o739 [[ o742 o190 ]] [lt] P=0.993611, C=7200.000000 > > > **2. Find loop body** > When I find myself in a loop, I try to localize the loop head and end, and map out at least one path between them. > `loop_end->print_bfs(20, loop_head, "c+")` > This provides us with a shortest control path, given this path has a distance of at most 20. > > Example (shortest path over control nodes): > > (rr) p find_node(741)->dump_bfs(20,find_node(746),"c+") > d dump > --------------------------------------------- > 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > > Once we see this single path in the loop, we may want to see more of the body. For this, we can run an `all paths` query, with the additional character `A` in the options string. We see all nodes that lay on a path between the start and target node, with at most the specified path length. > > Example (all paths between two nodes): > > (rr) p find_node(741)->dump_bfs(8,find_node(746),"cdmxo+A") > d apd dump > --------------------------------------------- > 6 8 146 CmpU === _ 141 79 [[ 147 ]] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 166 LoadB === 149 7 164 [[ 176 747 ]] @byte[int:>=0]:exact+any *, idx=5; #byte !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 8 147 Bool === _ 146 [[ 148 ]] [lt] !jvms: StringLatin1::replace @ bci:25 (line 304) > 5 5 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 5 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 8 176 CmpI === _ 166 169 [[ 177 ]] !jvms: StringLatin1::replace @ bci:28 (line 304) > 4 5 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 5 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 3 8 177 Bool === _ 176 [[ 178 ]] [ne] !jvms: StringLatin1::replace @ bci:28 (line 304) > 3 5 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 5 739 CmpI === _ 186 79 [[ 740 ]] !orig=[187] !jvms: StringLatin1::replace @ bci:19 (line 303) > 2 5 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 5 740 Bool === _ 739 [[ 741 ]] [lt] !orig=[188] !jvms: StringLatin1::replace @ bci:19 (line 303) > 1 5 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 5 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We see there are multiple paths. We can quickly see that there are paths with length 5 (`apd = 5`): the control flow, but also the data flow for the loop-back condition. We also see some paths with length 8, which feed into `178 If` and `148 Rangecheck`. Node that the distance `d` is the distance to the start node `741 CountedLoopEnd`. The all paths distance `apd` computes the sum of the shortest path from the current node to the start plus the shortest path to the target node. Thus, we can easily compute the distance to the target node with `apd - d`. > > An alternative to detect loops quickly, is running an all paths query from a node to itself: > > Example (loop detection with all paths): > > (rr) p find_node(741)->dump_bfs(7,find_node(741),"c+A") > d apd dump > --------------------------------------------- > 6 7 190 IfTrue === 741 [[ 746 ]] #1 !jvms: StringLatin1::replace @ bci:19 (line 303) > 5 7 746 CountedLoop === 746 745 190 [[ 746 148 141 ]] stride: 1 strip mined !orig=[735],[138] !jvms: StringLatin1::replace @ bci:22 (line 304) > 4 7 148 RangeCheck === 746 147 [[ 149 152 ]] P=0.999999, C=-1.000000 !jvms: StringLatin1::replace @ bci:25 (line 304) > 3 7 149 IfTrue === 148 [[ 178 166 ]] #1 !orig=170 !jvms: StringLatin1::replace @ bci:25 (line 304) > 2 7 178 If === 149 177 [[ 179 180 ]] P=0.800503, C=7153.000000 !jvms: StringLatin1::replace @ bci:28 (line 304) > 1 7 179 IfTrue === 178 [[ 741 ]] #1 !jvms: StringLatin1::replace @ bci:28 (line 304) > 0 0 741 CountedLoopEnd === 179 740 [[ 742 190 ]] [lt] P=0.993611, C=7200.000000 !orig=[189] !jvms: StringLatin1::replace @ bci:19 (line 303) > > We get the loop control, plus the loop-back `190 IfTrue`. > > Example (loop detection with all paths for phi): > > (rr) p find_node(141)->dump_bfs(4,find_node(141),"cdmxo+A") > d apd dump > --------------------------------------------- > 1 2 186 AddI === _ 141 51 [[ 185 739 141 ]] !orig=[738],... !jvms: StringLatin1::replace @ bci:13 (line 303) > 0 0 141 Phi === 746 36 186 [[ 185 186 162 146 154 154 747 ]] #int:0..max-1:www #tripcount !orig=[161] !jvms: StringLatin1::replace @ bci:22 (line 304) > > > **Color examples** > Colors are especially useful to see chains between nodes (options character `#`). > The input and output node idx are also colored if the node is displayed somewhere in the list. This should help you find chains of nodes. > Tip: it can be worth it to configure the colors of your terminal to be more appealing. > > Example (find control dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171135935-259d1e15-91d2-4c54-b924-8f5d4b20d338.png) > We see data nodes in blue, and find a `SafePoint` in red and the `Return` in yellow. > > Example (find memory dependency of data node): > ![image](https://user-images.githubusercontent.com/32593061/171138929-d464bd1b-a807-4b9e-b4cc-ec32735cb024.png) > > Example (loop detection): > ![image](https://user-images.githubusercontent.com/32593061/171134459-27ddaa7f-756b-4807-8a98-44ae0632ab5c.png) > We find the control and some data loop paths. Emanuel Peter has updated the pull request incrementally with one additional commit since the last revision: error message on bad options character. print usage/example for options character h/H ------------- Changes: - all: https://git.openjdk.java.net/jdk/pull/8468/files - new: https://git.openjdk.java.net/jdk/pull/8468/files/86be394a..3199ade6 Webrevs: - full: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=22 - incr: https://webrevs.openjdk.java.net/?repo=jdk&pr=8468&range=21-22 Stats: 218 lines in 1 file changed: 124 ins; 60 del; 34 mod Patch: https://git.openjdk.java.net/jdk/pull/8468.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8468/head:pull/8468 PR: https://git.openjdk.java.net/jdk/pull/8468 From kvn at openjdk.java.net Tue May 31 18:53:41 2022 From: kvn at openjdk.java.net (Vladimir Kozlov) Date: Tue, 31 May 2022 18:53:41 GMT Subject: RFR: 8287491: compiler/jvmci/errors/TestInvalidDebugInfo.java fails new assert: assert((uint)t < T_CONFLICT + 1) failed: invalid type # In-Reply-To: References: Message-ID: On Tue, 31 May 2022 13:00:57 GMT, Yasumasa Suenaga wrote: > We saw new assertion error after [JDK-8286562](https://bugs.openjdk.java.net/browse/JDK-8286562) ( #8646 ) > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/home/ysuenaga/github-forked/jdk/src/hotspot/share/utilities/globalDefinitions.hpp:735), pid=2619, tid=2635 > # assert((uint)t < T_CONFLICT + 1) failed: invalid type > > > It was caused by passing `JavaKind.Illegal` to slot kind at TestInvalidDebugInfo.java . We should pass `JavaKind.Void` instead of `JavKind.Illegal`. > > See JBS for more details (I attached hs_err log). @dougxc please look. For me the evaluation in bug report and the fix look reasonable. ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.java.net/jdk/pull/8954 From dnsimon at openjdk.java.net Tue May 31 19:02:38 2022 From: dnsimon at openjdk.java.net (Doug Simon) Date: Tue, 31 May 2022 19:02:38 GMT Subject: RFR: 8287491: compiler/jvmci/errors/TestInvalidDebugInfo.java fails new assert: assert((uint)t < T_CONFLICT + 1) failed: invalid type # In-Reply-To: References: Message-ID: On Tue, 31 May 2022 13:00:57 GMT, Yasumasa Suenaga wrote: > We saw new assertion error after [JDK-8286562](https://bugs.openjdk.java.net/browse/JDK-8286562) ( #8646 ) > > > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/home/ysuenaga/github-forked/jdk/src/hotspot/share/utilities/globalDefinitions.hpp:735), pid=2619, tid=2635 > # assert((uint)t < T_CONFLICT + 1) failed: invalid type > > > It was caused by passing `JavaKind.Illegal` to slot kind at TestInvalidDebugInfo.java . We should pass `JavaKind.Void` instead of `JavKind.Illegal`. > > See JBS for more details (I attached hs_err log). Marked as reviewed by dnsimon (Committer). ------------- PR: https://git.openjdk.java.net/jdk/pull/8954 From sviswanathan at openjdk.java.net Tue May 31 23:13:14 2022 From: sviswanathan at openjdk.java.net (Sandhya Viswanathan) Date: Tue, 31 May 2022 23:13:14 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 Message-ID: Fixed the assertion in load_iota_indices when the length passed is less than 4. Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. ------------- Commit messages: - 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 Changes: https://git.openjdk.java.net/jdk/pull/8961/files Webrev: https://webrevs.openjdk.java.net/?repo=jdk&pr=8961&range=00 Issue: https://bugs.openjdk.java.net/browse/JDK-8287517 Stats: 2 lines in 2 files changed: 1 ins; 0 del; 1 mod Patch: https://git.openjdk.java.net/jdk/pull/8961.diff Fetch: git fetch https://git.openjdk.java.net/jdk pull/8961/head:pull/8961 PR: https://git.openjdk.java.net/jdk/pull/8961 From jiefu at openjdk.java.net Tue May 31 23:33:41 2022 From: jiefu at openjdk.java.net (Jie Fu) Date: Tue, 31 May 2022 23:33:41 GMT Subject: RFR: 8287517: C2: assert(vlen_in_bytes == 64) failed: 2 In-Reply-To: References: Message-ID: On Tue, 31 May 2022 23:02:18 GMT, Sandhya Viswanathan wrote: > Fixed the assertion in load_iota_indices when the length passed is less than 4. > Also fixed the missing break in x86.ad match_rule_supported_vector() for PopulateIndex case. Shall we make a jtreg test for this fix? Thanks. ------------- PR: https://git.openjdk.java.net/jdk/pull/8961